8 SATA SSDs vs lonely NVME SSD: huge performance difference

Hey all,

I got my hands on eight 500GB Crucial MX300 SSDs for cheap.

“Great, lets try improving my free (company-wants-to-scrap-server) server with these bad boys.”

The server has the following specs:
CPU: Intel core i5 4590 (3.3GHz)
Motherboard: Gigabyte ga-h97-d3h
Memory: 22GB of DDR3 memory 1333MT/s
DELL Perc H310 flashed to IT mode
Boot disk: Samsung 980 (the villain, a cheap 1TB cache-less NVME SSD)
8x Crucial MX300 500GB SSDs; all connected to the DELL Perc H310

I am running Rock Linux 9.2 with kernel 5.14.0.

I wanted to see if my idea actually was good. Create a ZFS pool of 4 mirrored vdevs in a single pool for Postgres (the goal of this server). Surely the 8 SSDs in the optimal random 4k performance will outperform a single Samsung 980.

The NVME SSD is using XFS default settings is /etc/fstab

The zpool is configured via the following command:

Command
zpool create \
-O compression=lz4 \
-O atime=off \
-O relatime=off \
-O recordsize=8k \
-O primarycache=metadata \
-o ashift=12 \
-o autotrim=on \
-m none \
tank \
mirror /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_0 /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_1 \
mirror /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_2 /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_3 \
mirror /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_4 /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_5 \
mirror /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_6 /dev/disk/by-id/ata-Crucial_CT525MX300SSD1_7

zfs create -o mountpoint=none tank/postgres
zfs create -o mountpoint=/postgres/data tank/postgres/data
zfs create -o mountpoint=/postgres/wal tank/postgres/wal
zpool status
  pool: tank
 state: ONLINE
config:

        NAME                                         STATE     READ WRITE CKSUM
        tank                                         ONLINE       0     0     0
          mirror-0                                   ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819DFFEE5  ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819E018E8  ONLINE       0     0     0
          mirror-1                                   ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819E026BF  ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819E03626  ONLINE       0     0     0
          mirror-2                                   ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819E03A29  ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819E03D1A  ONLINE       0     0     0
          mirror-3                                   ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819E03D43  ONLINE       0     0     0
            ata-Crucial_CT525MX300SSD1_174819E03D96  ONLINE       0     0     0

errors: No known data errors

After a bunch of benchmarking with fio I saw the following numbers:

fio settings zpool read single nvme read zpool write single nvme write
4k rand read fsync io-depth 1 direct 119000 196000 - -
4k rand r/w fsync io-depth 1 direct 502 625 502 626
4k rand write end_fsync io-depth 1 - - 9280 101000
4k rand r/w io-depth 1 direct no fsync 1424 11700 1423 11600

WHAT?! How can 8 sata SSDs lose from a single cache-less NVME SSD???

  • Is my Dell Perc H310 and IO bottleneck?
    • I guess not because the first benchmark shows we can push almost 12K iops.
  • Is ZFS badly configured or ZFS holding me back for some reason?
    • I don’t know, 4k random IO is poor on ZFS?
    • Is record size 8k really that bad?
  • Am I benchmarking wrong?
    • Maybe, I copied some fio commands from the web
  • Are sata SSDs really that slow?
    • I think it is not that bad… right?

Somewhere next week I will upgrade this system with an Intel xeon E3-1275 V6 with 32GB of memory. I will repeat the same tests again and see if changes something.

TL;DR:
8 sata SSDs in 4 mirror zfs pool should be faster then single cheap nvme SSD right? Benchmark shows NVME wins. Why?

In the mean time I hope that somebody can share their experiences with 8 SSDs in a zpool quad-mirror. Or maybe you see my mistake! Which I currently do not see XD

NVMe has more bandwith than SATA? Idk not a computer scientist just an idea.

But then again 8 drives provide redundancy. One drive is none.

Don’t mock the 980. I got this model running in basically all my computers by now. Super power-efficient PCIe 3.0 drive with plenty of bang for the buck in the economy department. I use them for boot drives. Write cache wears out really fast under proper load, but boot drive doesn’t need it.

My bet would be on the 4x mirrors, but 4k QD1 is always special and overhead in managing 8 drives might be the reason. I think the mirrors will catch up as soon as you increase the block size, jobs and/or QD.
But that’s not the use case, so testing is of academic value. Stick to NVMe. Best for 4k is NVMe/Optane/DRAM anyway. And with fsync results being on par and database as a use case, I’d call it a draw anyway.

I’d test without mirrors and just 4 or 8 stripes again. Might be some overhead in writing to both sides of the mirror. Slowest device always sets the pace just like with RAIDZ. And some drives are always slower than others.

Or maybe the SATA controller (using the 8 drives across different controllers?)

MX300 is pretty old at this point in time. It’s still 8 drives, but more or less first-gen SATA SSD. We’ve come a long way since then.

8 SATA SSDs have the same bandwidth as a single PCIe 3.0 NVMe. And he’s using 4k reads and writes where bandwidth isn’t a concern at all. You can shove the 10-50MB/s using old ATA/IDE bus. Only random access latency is important here.

4k random IO is always bad. ZFS has overhead and don’t expect IOPS numbers you see on manufacturer data sheets in practice. Integrity features and all the ZFS goodies like CoW take their toll. Writing to ext4/xfs will always be way faster. 10k IOPS with fsync and all isn’t too shabby. I’d like to get 10k on QD1 on a high-performance Ceph cluster, ZFS is still fast in comparison

Put that NVMe in a new zpool as a single drive and test again. It’s more a ZFS benchmark than SATA vs. NVMe at the moment.

2 Likes

Oh oki I thought the controller would be the bottleneck

1 Like

Correct, but this is 4k random IO that I’m testing, where the gap is much closer. Random 4k IOPS is all I’m after.

1 Like

Good points.

Changing two parameters in a single test is a big no no.

I guess I will shrink the XFS partition to create a 40GB zpool to exclude the filesystem differences from testing. When the new hardware arrives I can split the drives between the H310 and motherboard.

My hypotheses is ZFS is holding it back OR 4 vdevs is inefficient to do 4k random iops with.

do you know what IOPS are? and how they pertain to the testing you are doing?

i have some 400gb high IOPS sata SSDs that absolutely decimate even fairly new NVME 3.0 drives in a very specific test, but loose by miles in basic storage and desktop use.

you can test and tune your SSDs and get some more out of them, but all things considered that looks pretty normal. They just have multiple different limits you will always bump up against before getting close to NVME. IOPS is one, but SATA speed even across the 4 drives is under what a single NVME 3.0 disk is capable of by a fair margin. not to mention your SSDs were not top performers even at there peak of sales.

As mentioned above…features cost performance. And ZFS is more than just a file system. It will always be outperformed by xfs in pure read or write scenarios. It often doesn’t make a difference because ZFS caches so well and IOPS aren’t worth a damn if you interact with anything over the network and network latency is killing any performance whatsoever.

I agree. Depends on workload. But I’ve seen Ceph benchmarks last week with 20k IOPS on SATA SSDs while NVMe drives scored 500 IOPS. Enterprise SATA vs. consumer NVMe, can make a big difference especially when talking about syncing and sustained heavy load scenarios.

Always get the right tool for the job. I trust my 980 drives with my life when it comes to booting, but I wouldn’t run them as OSDs in a ceph cluster or write-heavy ZFS pools or pools without log

1 Like
fio settings zpool read single nvme read nvme zfs read zpool write single nvme write nvme zfs write
4k rand read fsync io-depth 1 direct 119000 196000 98800 - - -
4k rand r/w fsync io-depth 1 direct 502 625 1397 502 626 1392
4k rand write end_fsync io-depth 1 - - - 9280 101000 11500
4k rand r/w io-depth 1 direct no fsync 1424 11700 3187 1423 11600 3180

Second tests is strange, but otherwise we can conclude that ZFS has features that have a bigger impact on random IOPs the I expected.

The second write speed tests could possibly be explained that the CoW helps the SSD by doing less work because it is a “new” write instead of an update from the SSDs point of view.

Thanks for everybody’s input.

2 Likes

I know what IOPs are. It was just a test to see if I could.

And you are right, Crucial MX300s where never high performing SSDs.

1 Like

ZFS can’t boost the reads without you enabling the cache. This is artificially limiting ZFS performance. And compression obviously has some overhead you don’t deal with in xfs. Then there is sync dataset property that makes or breaks performance depending on your setting and test bracket. Good and fair benchmarking is hard with so many cogs.

Disabling acceleration mechanics and let it fight an uphill battle with compression (lz4 is minimal, but still)…

In the end you probably have most of your 4k stuff cached, so pulling the data from ARC is a normal use case for reads. You want DRAM for reading small stuff like 4k and metadata.

Unless your database caches data in RAM anyway. In that case read benchmarks are pointless and only hit rate is important and directly proportional to the algorithm and RAM size. Inline compression of ZFS still allows for caching way more data than RAM would suggest. And database stuff usually compresses very well, which means like 3x/4x the data cached. Certainly important if memory is a limiting factor.

1 Like

You are asking the riqht and interesting question, but you are usingthe wrong method to attack this.

Ceteris paribus for the win.

You are comparing performance of exteremely simple setup ( cpu direct nvme) to exteremely complicated one (cpu → zpool → raid controller → sata array).

Establish baseline first, then test effect of each layer on the resulting performance.

All layer have an effect here.

  • how does plainly formattted sata SSD perform when connected to onboard sata controller ? ( NO ZFS, no raid, no perc)
  • how does mdadm mirror of sata drives perform when connected to onboard sata controller? (NO ZFS, no perc)
  • now replicate both measurements with drives connected to Perc H310 ( NO ZFS)
  • now try zfs without perc
    … etc

8 sata SSDs in 4 mirror zfs pool should be faster then single cheap nvme SSD right?

In some specific circumstances yes, but it would have to be very expensive setup.

NVME replaced sata/sas for a reason. Reason being that these protocol were never designed with something like flash media in mind and ended up fundamentally bottlenecking the ssds hard.

Same goes for hardware raid controllers, anything but direct pcie connection is bottlenecking drive performance.

1 Like

I benched the nvme with ZFS (fresh install) and ZFS has a major impact. As expected.

ZFS can boost reads via multiple vdevs, but with an io-depth of one and a blocksize of 4k you are right that I’m basically testing a single SSD for most of the benchmarks.

Thanks for your insight. I thing I got the conclusions for the questions I had.

  • NVME is a lot faster then I thought
  • ZFS has an higher performance impact then expected for random 4k IO
  • I need to understand the fio tool better to create the scenario’s I want to test

Have you tried mdadm striped with xfs? I found that to be pretty zippy compared to everything else I tried when I was using a F80 card as a write cache. I didn’t test extensively or anything, though.

No, I havn’t. I have read some articles online about the benefits of combining ZFS and Postgres. That is what started this whole thing. Did some basic testing and got results I did not expect.

Some of the articles:

If you want to continue this experiment, look into getting testing platform with better connectivity options and go for some used enterprise u.2 drives.

You will need the pcie lanes to directly attach them or for HBAs. Direct attach is prefered , because any controller will limit the performance.

Cpu itself must be also performant enought to manage the io operations, hence epyc again.

Second hand epyc platform would be ideal candidate.

For basic orientation in the area, I would recommend following:

Both cover enteprise hardware in depth, not seen anywhere else. For funsies, I also remeber linus sex tips setting up epyc zfs server with shitton of fast kioxia drives (?).

But i cant find it, they certainly got insane random io performance out of it. But the hardware was what definitely 50-100k USD+.

1 Like

I already got new hardware on the way. And I also ordered today :innocent: two Intel DC P3605 1.6TB NVME SSDs from ebay. But those will take a couple of weeks because of shipping (early next month). The new motherboard and CPU (Intel xeon E3-1275 V6 with a Supermicro X11SAE) will support both Intel SSDs at full speed.

It will allow me to keep the Rocky Linux install on the Samsung m.2 and reconfigure the two Intel SSDs without reinstalling the system.

Maybe I will update this thread or start a new one with tests where only one variable changes at a time (file system).

Good, this board has at least one full x16 pcie slot, so you should be able to use 4x m_2 board adaptor using pcie bifurcation (like this model)

I got the same board in m-atx configuration and E3-1225 v6, where this is missing. So I am slowly migrating away from it.

Intel DC P3605 are good drives, but dont expect miracular performance uplift in low QD and low threaded workloads. There they are eclipsed by consumer products, but once you load them, they will keep chugging long after consumer drives fails or performance degrades.

Lower peak performance but its absolutely consistent at nearly any level of load. I dont even think E3-1225 v6 has enough oomph to generate enough workload to saturate these drives.

Similar drive benchmarks:

This motherboard does not support pcie bifurcation to my knowledge. Nothing about in the manual, the two pcie x16 slots support x8/x8 but that is all I understand from reading the motherboard manual and Intel ARK pages about the CPU and chipset.

It is a known fact that consumer drives only focus on burst and sequential IO for those nice numbers for the marketing team. But that is not what a database needs.

When you run a query on a database you want it to be predictable, for example selecting a user by id takes 5ms. With a consumer drive it would fluctuate more and have a higher standard deviation.

And I guess two mirrored Intel DC P3605 will last me a really long time. Which is great!