Wendell's ZFS performance

TrumanHW · August 27, 2019, 1:11am

Wendell said he was able to 'SATURATE’ 10GbE via spinning drives (using disk shelves). I’m unaware of a how to thread explaining how to accomplished his performance, which includes the trial and error of others, here.

I can think of possible reasons his results differed so drastically.

Is this the difference purpose built NAS/SAN hardware (Netapp) being ‘better’ at this than Generic servers? If so, it’s an HBA in the NetApp, that’s going to raise more questions than it answers.

I’ve seen quite a few people stuck in the same MB/s range with similar equipment, but, there’s a stack of people who get much better results — and in comparing the setups, nothing stands out as the ‘ah ha’ reason.

Was it the disk shelves? Or some other piece of hardware?

• Netapp DS4246
• IOM6 controller (is the IOM12 just SAS-3 ?)
• Was it the quantity of drives? ~ 21 HDs or so?
• Or maybe he was striping the shelves?
• If he’s striping shelves, the speed-per-shelf would be good info

Speaking of the IOM6 vs IOM12:
Are all HD compatible with it as an HBA?

And ‘only requires Netapp approved devices if using a Netapp RAID controller?

Regarding performance that’s reasonable/plausible …

There’s an extensive list of REAL test results which I’d love to even get close to.

calomel (website) … has a page you can google called zfs_raid_speed_capacity

Consistent with the speeds in the list, Wendell says he gets 800 - 1,100 MB/s …

However, in a thread here, a member got a whole 400 MB/s using 4x evo 970 SSDs

And in threads everywhere, people get 150 - 200 MB/s using 4+ 7200 drives.

(Yes, I know rotational speed isn’t supposedly not very important. I’m just not going to omit it).

This is a huge range…
If a customer asked for a storage solution and I quoted that range they’d laugh.

In the last year I’ve considered just giving up and accepting that ~8 HDs in RAIDZ-2 is ~140 MB/s … pending on voodoo, file size, network speed & traffic.

Then I run in to reports of blissful performance – & my optimism overrules my skepticism. So, I’m (again) thinking of wasting another 20 hours trying fix tune the performance.

The performance results are attached as a picture to ensure formatting is legible.

One of those results includes RAIDZ-2 of only 6 drives – yet, gets
Writes - 429 MB/s
Mixed - 71 MB/s
Read - 488 MB/s

granted - the mixed speeds aren’t impressive, but I don’t need mixed.
I’d be VERY happy to approximate these 300 - 400 MB/s results.

Right now I get closer to 150 MB/s using 8 drives. This isn’t via Bonnie++ … it’s over 10GbE (I’ve gotten up to 240MB/s, ooooh) … it’s just not even CLOSE to what others are getting.

Dell PowerEdge T320
Xeon E5-2403 v2 Quad 1.80GHz
32GB 1333MHz DDR3 ECC
LSI SAS 9205-8i
8x 7.2K IBM/HGST SAS
Configuration: RAIDZ2
Network - SFP+ (10GbE)

Thanks

kewldude007 · August 27, 2019, 4:12am

@wendell

wendell · August 27, 2019, 4:18am

One NetApp shelf on active active mode with 24 2tb drives is nearly 2 gbyte/sec read and around 1 gbyte/sec write.

Lsi sas6 Hba. 8 channels. High cpu clock speeds. Lots of memory. Relatively incompressible dataset.

If your disk shelf can do active/active it should be comparable performance.

There is a thread here somewhere that explains the vdevs and layout. The shelf has I think 4 vdevs iirc. Raidz2.

Log · August 27, 2019, 6:35am

Assuming the bottleneck isn’t elsewhere (controllers, configuration for workload, other system hardware etc) then the basic idea of ZFS performance is
-“More VDEVS, more better”
-Followed by having more drives in a vdev, which is quickly subject to diminishing returns.
-Mirrors are better at random IO, because you get a load of vdevs.
-RAIDZx is better at throughput (streaming reads or writes).
-Mixed read/writes will always be pretty shit on rust, use SSD’s

10 Gbps is 1250 MB/s, though this is an imaginary conversion, and doesn’t take into account other factors like the encoding of the data (some of the data sent is for detecting transmission errors. 10gb ethernet and fiber channel use 64b/66b encoding, so really 1212 MB/s is the perfect theoretical max.

Take the 6 disk RAIDZ2 or the 5 disk RAIDZ1. Those are just one Vdev. Add another one of those vdevs and you’d be much closer at saturation with just a bit of headroom, and a third vdev would put you just over saturation.

It should also be noted that benchmarks for ZFS are frought with a variety of pitfalls.
If it doesn’t use FIO, forgets about ZFS compression (uses “dev=zero”), forgets about arc caching, or doesn’t understand the limits of various random number generation methods, then it’s probably not a good benchmark. I’ve seen the problems with bonnie++ discussed before, but I’ve long since forgotten what they were. Jim Salter has negative opinions about it and that’s all I really care to know.

As such, a read world transfer of data is often the best test, just don’t forget about arc caching.

Last but not least, keep in mind that ZFS isn’t perfect. Out of the box it’s generally good enough, but when really reaching for performance there can be a large variety of esoteric and weird performance issues that can really hit some people hard and no average joe seems to know why, and the professionals that might know certainly aren’t getting paid enough to volunteer that information. Here’s a thread right now on the abysmal performance of a 12 vdev pool of mirrors. It should be amazing, but it’s complete shit and everyone is fumbling to figure out why.

When it comes to SSD’s, then even more special effort is needed to truly pull out the performance they are capable of. ZFS is still tuned for HDD’s, and even that isn’t that great.

I do feel your frustration. ZFS on linux is a stupid black box of performance right now. If you can manage to get it to run, freebsd/freenas’s ZFS (note that is hates certain usb sticks for some reason and will refuse to install from them, lol) might be worthwhile, as they tend to be a bit more fanatical about the tuning.

Also make sure you aren’t getting hit be the performance regressions from the linux kernal symbol restrictions bullshit. ZFS fixes are coming, but right now I think a patch is needed.

TL:DR
You are talking about performance in a very complex system. There aren’t very good answers as this point beyond “add more vdevs” and hope it fucking works. I wish I could help more but I don’t understand it well myself.

TrumanHW · August 27, 2019, 8:56pm

LOG … excellent writing / logic / clarity / insight. THANK! YOU!

PRIMARY QUESTIONS:

LSI 9200 controller a possible bottleneck…?
Any tiering solution on the horizon?
Is 1.8GHz slow enough to cause a throttling of a single-transfer?

Do YOU know what he’s referring to by: ‘active active’ – & what it effects? increase throughput?
Is Netapp equipment faster at SAN/NAS roles bc it’s ‘purpose built’ ?
Can you use any computer as a disk shelf’s management?

LESS IMPORTANT – Nested ZFS zVols:
[RAIDZ1 comprised of 4 drives] = vDev || (3) x (RAIDZ1 vDevs) = RAIDZ1 of RAIDZ1 vDevs
[2 stripes of 3•drive RAIDZ1 vDev] mirrored [2 stripes of RAIDZ1 vDev]

Basically, a RAID comprised of RAIDs vs striped•mirrored RAID arrays …

TL:DR Questions / comments
Tests I plan on running with spinning drives (4TB HGST sas-2 7200 Ultrastars):
Stripe a pair of RAIDZ1 (in practice, I’d prefer a min. 5-6-drives per stripe in RAIDZ2)
Mirror a pair of RAIDZ1
RAID-0 of 4 drives … Mirrored.

Striping RAIDZ2 seems within the limits of ‘risk’ …
(Why not call this RAIDZ 20 …?)
Mirroring set as RAIDZ1 seems reasonable …
(Why not call this RAIDZ 11 …?)

If FreeNAS supports my HighPoint NVMe controller I’ll test 4x PM983 NVMe drives:

RAID10
RAIDZ1
RAIDZ2
though I expect RAIDZ2 of 4 drives is slower but equally ‘inefficient’ to [mirrored • stripes].

RAID50 seems too risky as losing a 2nd drive from either stripe risks the entire set.
And, mirroring only seems enticing for ultra precious data or if data is also striped.

PS – in case it’s useful to anyone: Dell Poweredge T630 / T640 holds 18x LFF or 32 SFF

oO.o · August 28, 2019, 6:53am

Since you made a new thread, I guess I get to ask a 3rd time: what OS and ZoL version (if Linux) are you using?

freqlabs · August 29, 2019, 12:25am

Don’t.

Fine.

Don’t.

Model?

Yes.

Have hot spares.

Is for performance.

2bitmarksman · August 30, 2019, 6:10am

So here’s some screenshots from my RAIDZ2 8x4TB single pool in FreeNAS. My main box is an ESXi host, with FreeNAS as a VM with 1c/2t of my Ryzen 3700x @ 4.1GHz, 24GB of RAM give to it, and a 9211-8i passed through. Windows desktop is a VM on the same box with 4c/8t and 16GB of RAM, virtual networking is set for virtual 10G, MTU 9000.

oO.o · August 30, 2019, 12:41pm

That’s with all the tunables as well?

2bitmarksman · August 30, 2019, 2:09pm

Yes, that’s all the tunables I’m using

oO.o · August 30, 2019, 2:19pm

I’ll have to do some retests on the FreeNAS Mini XL (I hate typing that name out) and see what’s up.

Sorry, I missed that screenshot initially (on mobile).

freqlabs · August 30, 2019, 11:04pm

You have typos in the names of the third and fourth tunables, which might explain the poor read performance.

Overall, those are some excessively high numbers for a gigabit network imo.

oO.o · August 30, 2019, 11:07pm

It’s 10 gig no?

freqlabs · August 30, 2019, 11:24pm

Oh I missed that FreeNAS is a VM on the same host, I thought FreeNAS was backing the VMs for ESXi.

oO.o · August 30, 2019, 11:28pm

If you had 10 gig on FreeNAS but mixed 1g and 10g clients what do you tune for?

freqlabs · August 30, 2019, 11:37pm

Tune it for 10 Gig. I just find it silly to apply a bunch of tuning for 1 Gig as if it’s somehow pushing the limits of what FreeNAS can do out of the box. Even for 10 Gig there shouldn’t be any “one size fits all” tuning required for a small number of users. You can optimize tuning for a particular workload if you have particular needs (like hundreds of simultaneous users), but the defaults should be good in general, if they’re not it’s a bug.

oO.o · August 31, 2019, 12:22am

A lot of people who have sequential smb workloads on 10g use those tunables from 45drives. I’ve always seen an improvement, although I’m not sure how much of that is just jumbo frames.

freqlabs · August 31, 2019, 12:27am

Would be an interesting experiment to measure and find out

TrumanHW · August 31, 2019, 7:57am

Sorry - I had a long couple of workdays. I’m on FreeNAS 11.2

I’ll do the MTU alone first and see what changes that makes…

Love the info - and very grateful for how helpful you’ve both been.

2bitmarksman · August 31, 2019, 11:04pm

I’ll try the tests again without the Tunables (just MTU 9000) when I get the chance. Jumbo Frames give a huge boon just on their own