DIY NFS HDD raid 6 array - what reasonable stats should I pursue and how to tune NFS performance?

Hi.

I decided to repurpose my old junk PC as dedicated RAID array serving NFS share. I used to use this PC for like 12 years as just baremetal workstation with DAS RAID6 array however due to my desperate attempts to keep this junk alive it was one hell of a hackjob (including fact that it used two mid-tower ATX cases duct taped together in order to house all 14 HDDs at once) so I don’t even really know what baremetal performance did I get from it back in the day since there were bottlenecks everywhere (there probably still are some of them now).

I know using old hardware is usually not-all-that optimal solution but I just don’t like e-waste plus this machine kind of has sentimental value for me (plus tbh it’s not THAT bad piece of machine and still somewhat usable hardware even today, especially for something simpler like NFS server)

It’s using i7-2600k, ASUS Sabertooth P67 mobo, 32gb RAM, put in 3U rackmount ATX chassis. I put IBM SAS controller in it and connected 8 HDDs to this one:

Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)

Plus 6 HDDs to on-board Intel SATA II controller:

00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Desktop SATA AHCI Controller (rev 05)

There’s also OS SSD connected to one of the following SATA III controllers but idk which one precisely

03:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9123 PCIe SATA 6.0 Gb/s controller (rev 11)
06:00.0 SATA controller: JMicron Technology Corp. JMB362 SATA Controller (rev 10)
0a:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9172 SATA 6Gb/s Controller (rev 11)

That said it’s mostly irrelevant since HDD 14-bay RAID6 array is star of the show here. I’m using 14x Seagate NAS 2TB drives (of varying age, some of them are indeed old-school Seagate NAS drives, some are newer revisions and some are new Seagate IronWolf successors). Also there are few WD RED 2TB drives:

Device Model:     ST2000VN000-1H3164
Device Model:     ST2000VN000-1H3164
Device Model:     ST2000VN000-1H3164
Device Model:     ST2000VN000-1HJ164
Device Model:     ST2000VN000-1HJ164
Device Model:     ST2000VN004-2E4164
Device Model:     ST2000VN004-2E4164
Device Model:     ST2000VN004-2E4164
Device Model:     ST2000VN004-2E4164
Device Model:     WDC WD20EFRX-68EUZN0
Device Model:     WDC WD20EFRX-68EUZN0
Device Model:     WDC WD20EFRX-68EUZN0
Device Model:     WDC WD20EFRX-68EUZN0
Device Model:     WDC WD20EFRX-68EUZN0

RAID is using encryption (LUKS) and it’s software based (btrfs raid56) with lzo compression.

For networking I’m using Intel X710-DA2 2x10G network card with LACP bonding:

02:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
02:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)

As you may have noticed P67 chipset was the last one to still feature pci-e 2.0 so both IBM SAS/SATA controller and Intel network card use pci-e 2.0 x8 connections. In order to make PC bootable I bought GT710 pci-e x1 GPU. All in all it looks like this:

Sequential i/o is definitely within expectations, however I’m not entirely sure whether random i/o performance is adequate for such setup and I’m not sure whether it’s networking or host issue. nfsiostat gives me following results:
optimistic:

read:              ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)
                   1.844         637.400         345.616        0 (0.0%)           5.806           5.906

pessimistic:

read:              ops/s            kB/s           kB/op         retrans    avg RTT (ms)    avg exe (ms)
                  10.183        3323.729         326.399        0 (0.0%)           9.714           9.812

mountstat:

RPC statistics:
  30657 RPC requests sent, 30657 RPC replies received (0 XIDs not found)
  average backlog queue length: 0

READ:
        14096 ops (45%) 
        avg bytes sent per op: 236      avg bytes received per op: 350632
        backlog wait: 0.048808  RTT: 7.869396   total execute time: 7.977795 (milliseconds)
GETATTR:
        7274 ops (23%) 
        avg bytes sent per op: 219      avg bytes received per op: 226
        backlog wait: 0.007424  RTT: 0.199615   total execute time: 0.232747 (milliseconds)
OPEN_NOATTR:
        2448 ops (7%) 
        avg bytes sent per op: 288      avg bytes received per op: 352
        backlog wait: 0.019199  RTT: 0.189951   total execute time: 0.227533 (milliseconds)
CLOSE:
        2448 ops (7%) 
        avg bytes sent per op: 228      avg bytes received per op: 112
        backlog wait: 0.018382  RTT: 0.160539   total execute time: 0.194036 (milliseconds)
DELEGRETURN:
        2447 ops (7%) 
        avg bytes sent per op: 240      avg bytes received per op: 160
        backlog wait: 3.134859  RTT: 0.567634   total execute time: 3.711483 (milliseconds)
ACCESS:
        723 ops (2%) 
        avg bytes sent per op: 227      avg bytes received per op: 168
        backlog wait: 0.008299  RTT: 0.218534   total execute time: 0.260028 (milliseconds)
...

I’m testing it using laptop with 5G Aquantia USB NIC. seq i/o saturates this NIC so there’s nothing to be concerned about but random i/o feels a little bit sluggish (stuff like thumbnails generation etc).

I did not perform any particular configuration, apart from stuff like no_subtree_check in /etc/exports:

/dmnt/raid *(rw,sync,no_subtree_check,no_root_squash)

And I’m not really sure where to start benchmarking and looking for bottlenecks.

It’s slow because you’re likely overloading the CPU and LUKS kinda kills performance.

Good read, I decided to perform similar benchmarks for my machine.

Though I believe I’m probably not even close to be bottlenecked by LUKS.

I tried to perform benchmarks using fio and script mentioned in article but it turns out that… it’s apparently very difficult to benchmark filesystem with native lzo compression. Because my benchmark results were all over the place depending on what types of blocks fio encountered (nicely compressed or not).

auto-generated by fio test file was completely useless because it was so easily compressible that I achieved over 600 MB/s read on 1M block with direct i/o which is completely unrealistic. Then I used actual 1TB encrypted VM backup image (so kinda worst case scenario for compression) for test and i/o flopped on its face, dipping down to 50 MB/s on some sections, while skyrocketing to 400 MB/s on other parts of file…

Nevertheless after running hundreds of tests I managed to get some relatively repetitive results on bare metal host which go as follows:

SEQ direct:
64M: IOPS=7    READ=488 MB/s
 1M: IOPS=224  READ=225 MB/s
64k: IOPS=1193 READ=74 MB/s
 4k: IOPS=7788 READ=30 MB/s
SEQ cached:
64M: IOPS=7    READ=456 MB/s
 1M: IOPS=454  READ=455 MB/s
64k: IOPS=7185 READ=449 MB/s
 4k: IOPS=115k READ=448 MB/s
RAND direct:
64M: IOPS=4  READ=285 MB/s
 1M: IOPS=54 READ=54 MB/s
64k: IOPS=80 READ=5.1 MB/s
 4k: IOPS=88 READ=0.3 MB/s
RAND cached:
64M: IOPS=2  READ=162 MB/s
 1M: IOPS=48 READ=48 MB/s
64k: IOPS=76 READ=4.8 MB/s
 4k: IOPS=47 READ=0.2 MB/s

Throughout entire seq benchmark CPU load stayed quite low - below 30%. CPU didn’t even crank up clocks to full base freq.
iotop was showing around 60-80% i/o load with cached i/o (and 0 with direct i/o bc it probably doesn’t work with direct i/o).

During random read tests on small blocks i/o load indicated by iotop during cached operation was very high (99%) and CPU load indicated by htop mostly displayed io_wait on all cores.

The same benchmarks performed over NFS from laptop look as follows:

SEQ direct:
64M: IOPS=5    READ=342 MB/s
 1M: IOPS=247  READ=248 MB/s
64k: IOPS=2392 READ=150 MB/s
 4k: IOPS=4981 READ=19 MB/s
SEQ cached:
64M: IOPS=5    READ=349 MB/s
 1M: IOPS=347  READ=348 MB/s
64k: IOPS=5274 READ=330 MB/s
 4k: IOPS=88k  READ=344 MB/s
RAND direct:
64M: IOPS=4   READ=259 MB/s
 1M: IOPS=64  READ=64 MB/s
64k: IOPS=345 READ=21 MB/s
 4k: IOPS=424 READ=1.6 MB/s
RAND cached:
64M: IOPS=2   READ=142 MB/s
 1M: IOPS=51  READ=51 MB/s
64k: IOPS=145 READ=9 MB/s
 4k: IOPS=188 READ=0.7 MB/s

Though judging by HDD activity LEDs on server I have serious doubds regarding legitimacy of those tests over NFS and I believe they were served mostly or fully from RAM cache on the server (and as such represent best case scenario NFS performance on network level). Server also displayed 0% i/o in iotop and 0MB/s disk read.

NIC throughput was saturated though so I guess data was not served from laptop cache. It seems to be confirmed by fact that after running sh -c "sync; echo 3 > /proc/sys/vm/drop_caches" on server i/o and HDD activity LED appeared again and fio performance results were noticeably lower but only for 1 run of fio. Repeated run resulted in returning to old behavior - 0% i/o, no HDD activity and higher benchmark results.

That said it’s okay since such behavior allowed me to compare baremetal i/o limitations to NFS network limitations and apparently they seem to be kind-of aligned and reasonable so NFS itself doesn’t seem to be all that much of a bottleneck considering how terrible overall random i/o perfomance of this array is…