Looking for ideas/help/pointers maximizing my NFS performance on CentOS7 and sorting out systemd issues booting up reliably without intervention.
The Good:
- bcache working locally on NFS server as expected (except for systemd/boot configuration). ~1.5GB/s write, ~2GB/s read to /dev/bcache0 mount using 1M block-size
- NFS over 40GbE read performance with 1M blocks is good (2GB/s+)
The Bad:
- systemd/boot random failure on reboot to process my hacked systemd integration for lack of a provided bcache.service on CentOS. there is some ordering subtlety I am clearly missing…
- NFS write performance bottlenecked ~800MB/s - I suspect by NFS deamon block-size? I increased NFS server threads to 16 (less than total cores in the system), but 800MB/s corresopnds to a 64K write size via dd.
Background:
OS: CentOS 7.5 (4.15.18 kernel)
NFS Disks: 10x8T raid6 (/dev/md1), 1x900p 280G optane (/dev/nvme0n1p1)
NFS network: 40GbE (Mellanox cards, Dell S6000 switch)
Sanity Check on network throughput:
[SUM] 0.00-10.00 sec 35.3 GBytes 30.3 Gbits/sec 2346 sender
[SUM] 0.00-10.00 sec 35.3 GBytes 30.3 Gbits/sec receiver
Sanity Check on local performance (this is the NFS server):
dd if=/dev/zero of=test.img bs=1M count=10000 oflag=direct
10485760000 bytes (10 GB) copied, 7.29606 s, 1.4 GB/s
dd if=/dev/zero of=test.img bs=65536 count=200000 oflag=direct
13107200000 bytes (13 GB) copied, 16.3757 s, 800 MB/s
I included the 2nd one because this is roughly what my NFS write performance looks like… I’m wondering if it is just as simple as adjusting its cache-commit size and how to do so?
NFS client performance:
WRITE:
dd if=/dev/zero of=test.img bs=1M count=10000
10485760000 bytes (10 GB) copied, 12.8208 s, 818 MB/s
READ (cold client cache - first read):
dd if=test2.img of=/dev/null bs=1M
13107200000 bytes (13 GB) copied, 4.96747 s, 2.6 GB/s
Bcache Setup/Issues:
bcache’s udev automation does not seem to play well with either the cache drive being nvme or the backing store being a raid6 (md) device.
It seems it tries to auto-scan, but fails with:
Nov 21 20:01:27 xxxxx systemd-udevd: failed to execute '/usr/lib/udev/probe-bcache' 'probe-bcache -o udev /dev/nvme0n1': No such file or directory
Nov 21 20:01:27 xxxxx systemd-udevd: failed to execute '/usr/lib/udev/probe-bcache' 'probe-bcache -o udev /dev/nvme0n1p1': No such file or directory
....
Nov 21 20:01:28 xxxxx systemd-udevd: failed to execute '/usr/lib/udev/probe-bcache' 'probe-bcache -o udev /dev/md1': No such file or directory
I believe this is a basic ordering issue with mdmonitor.service and udev, but I’m not thrilled about messing with that…
The only reason it works despite this failure is that I’ve added “bcache.service” file:
[Unit]
Description=Bcache Setup
After=mdmonitor.service
Before=local-fs.target
[Service]
Type=simple
ExecStart=/etc/bcache/init.sh
TimeoutStartSec=0
[Install]
WantedBy=multi-user.target
Where “init.sh” (re)registers/configures the cache/store every time. I created this bash script called by systemd.
#!/bin/bash
echo /dev/nvme0n1p1 > /sys/fs/bcache/register
echo /dev/md1 > /sys/fs/bcache/register
echo 0 > /sys/fs/bcache/c575c811-f78f-4953-bfeb-5e090865caa4/congested_read_threshold_us
echo 0 > /sys/fs/bcache/c575c811-f78f-4953-bfeb-5e090865caa4/congested_write_threshold_us
echo 0 > /sys/block/md1/bcache/sequential_cutoff
Because I have poor control/understanding of the order here with boot-up, I cannot add /dev/bcache0 to /etc/fstab, so I am using autofs to mount the volume locally - it will be automounted by nfs. That seems to work if systemd is able to come up clean.
Systemd will randomly fail to come up cleanly, I presume because I am missing a dependency. I’ve tried various .target “Before” and “After” clauses but they generally create “loops” in systemD as I am messing with local-fs, local-fs-pre, etc… and there are deep dependencies there.
If someone has bcache running on nvme on an ubuntu system, I’d love to see what the .service files look like?
I’m looking through nfs tuning parameters to see if I can encourage the NFS daemon to use 1MB blocks when writing to the disk locally? This server has a UPS and critical data is backed up elsewhere, so this system can trade risk of corruption for speed to a large degree. Obviously lost data is always a pain, but…
Thanks in advance for any insight provided…