NFS, 40GbE, bcache, optane, rdma performance/configuration (CentOS7)

cekim · November 24, 2018, 1:12am

Looking for ideas/help/pointers maximizing my NFS performance on CentOS7 and sorting out systemd issues booting up reliably without intervention.

The Good:

bcache working locally on NFS server as expected (except for systemd/boot configuration). ~1.5GB/s write, ~2GB/s read to /dev/bcache0 mount using 1M block-size
NFS over 40GbE read performance with 1M blocks is good (2GB/s+)

The Bad:

systemd/boot random failure on reboot to process my hacked systemd integration for lack of a provided bcache.service on CentOS. there is some ordering subtlety I am clearly missing…
NFS write performance bottlenecked ~800MB/s - I suspect by NFS deamon block-size? I increased NFS server threads to 16 (less than total cores in the system), but 800MB/s corresopnds to a 64K write size via dd.

Background:
OS: CentOS 7.5 (4.15.18 kernel)
NFS Disks: 10x8T raid6 (/dev/md1), 1x900p 280G optane (/dev/nvme0n1p1)
NFS network: 40GbE (Mellanox cards, Dell S6000 switch)

Sanity Check on network throughput:

[SUM]   0.00-10.00  sec  35.3 GBytes  30.3 Gbits/sec  2346             sender
[SUM]   0.00-10.00  sec  35.3 GBytes  30.3 Gbits/sec                  receiver

Sanity Check on local performance (this is the NFS server):

dd if=/dev/zero of=test.img bs=1M count=10000 oflag=direct
10485760000 bytes (10 GB) copied, 7.29606 s, 1.4 GB/s

dd if=/dev/zero of=test.img bs=65536 count=200000 oflag=direct
13107200000 bytes (13 GB) copied, 16.3757 s, 800 MB/s

I included the 2nd one because this is roughly what my NFS write performance looks like… I’m wondering if it is just as simple as adjusting its cache-commit size and how to do so?

NFS client performance:

WRITE:
dd if=/dev/zero of=test.img bs=1M count=10000
10485760000 bytes (10 GB) copied, 12.8208 s, 818 MB/s
READ (cold client cache - first read):
 dd if=test2.img of=/dev/null bs=1M
13107200000 bytes (13 GB) copied, 4.96747 s, 2.6 GB/s

Bcache Setup/Issues:
bcache’s udev automation does not seem to play well with either the cache drive being nvme or the backing store being a raid6 (md) device.

It seems it tries to auto-scan, but fails with:

Nov 21 20:01:27 xxxxx systemd-udevd: failed to execute '/usr/lib/udev/probe-bcache' 'probe-bcache -o udev /dev/nvme0n1': No such file or directory
Nov 21 20:01:27 xxxxx systemd-udevd: failed to execute '/usr/lib/udev/probe-bcache' 'probe-bcache -o udev /dev/nvme0n1p1': No such file or directory
....
Nov 21 20:01:28 xxxxx systemd-udevd: failed to execute '/usr/lib/udev/probe-bcache' 'probe-bcache -o udev /dev/md1': No such file or directory

I believe this is a basic ordering issue with mdmonitor.service and udev, but I’m not thrilled about messing with that…

The only reason it works despite this failure is that I’ve added “bcache.service” file:

[Unit]
Description=Bcache Setup
After=mdmonitor.service
Before=local-fs.target

[Service]
Type=simple
ExecStart=/etc/bcache/init.sh
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target

Where “init.sh” (re)registers/configures the cache/store every time. I created this bash script called by systemd.

#!/bin/bash
echo /dev/nvme0n1p1 > /sys/fs/bcache/register
echo /dev/md1 > /sys/fs/bcache/register
echo 0 > /sys/fs/bcache/c575c811-f78f-4953-bfeb-5e090865caa4/congested_read_threshold_us
echo 0 > /sys/fs/bcache/c575c811-f78f-4953-bfeb-5e090865caa4/congested_write_threshold_us
echo 0 > /sys/block/md1/bcache/sequential_cutoff

Because I have poor control/understanding of the order here with boot-up, I cannot add /dev/bcache0 to /etc/fstab, so I am using autofs to mount the volume locally - it will be automounted by nfs. That seems to work if systemd is able to come up clean.

Systemd will randomly fail to come up cleanly, I presume because I am missing a dependency. I’ve tried various .target “Before” and “After” clauses but they generally create “loops” in systemD as I am messing with local-fs, local-fs-pre, etc… and there are deep dependencies there.

If someone has bcache running on nvme on an ubuntu system, I’d love to see what the .service files look like?

I’m looking through nfs tuning parameters to see if I can encourage the NFS daemon to use 1MB blocks when writing to the disk locally? This server has a UPS and critical data is backed up elsewhere, so this system can trade risk of corruption for speed to a large degree. Obviously lost data is always a pain, but…

Thanks in advance for any insight provided…

nx2l · November 24, 2018, 3:27am

Have you tried lvmcache?

FYI: That kernel is not standard for centos 7.5

cekim · November 24, 2018, 3:42am

I’m not a fan of lvm… too much automagic than ends up stepping on its own feet… but no, not yet… it was on the list as was flashcache…

I also thought lvmcache was one that tended to perform poorly for write-caching. It was good at serving stale, frequently accessed data, but less so for tiered hot/cold access. I did intend to try it first-hand though.

Non-standard kernel because:

I needed newer for bcache
I needed the highpoint RR840A driver to compile as it is my HBA on this machine.
I’ve found from other noodling that 3.x + smelt patches hurt a lot more than 4.x which is virtually identical to 3.x prior to smelt.

nx2l · November 24, 2018, 10:00am

if you want lvmcache to cache the writes… you have to change the mode to writeback (default is write through)

cekim · November 25, 2018, 7:32pm

After playing with NFS for a bit, I am not able to get write performance to exceed 800MB/s. I tried /etc/sysctl.conf:
nfs.nfs4_bsize = 1048576
and
fs.nfs.nfs4_bsize = 1048576

Didn’t seem to change anything… 800MB/s is roughly what I get when I use DD to write with 64k blocks.

So, for now I moved on to lvm:

pvcreate --force /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1
vgcreate md1_vg /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1
lvcreate --type raid6 -l 100%free -n md1 md1_vg
vgextend md1_vg /dev/nvme0n1
lvcreate --type cache --cachemode writeback -l 100%free -n md1_cache md1_vg/md1 /dev/nvme0n1
mkfs.xfs /dev/md1_vg/md1
dracut -v -f

12 hours of raid6 resync (-nosync option is not allowed) and counting…

Can’t really see real performance until that sync is done… right now it works, but produces ~300MB/s.

I may create a large ramdisk on another server and see if I can isolate the NFS write performance issue…

cekim · November 25, 2018, 8:59pm

machine is still busy with mdx_resync, but I created a 16GB ramdisk and mounted that via NFS:

server:
mkdir -p /mnt/ramdisk
mount -t tmpfs -o size=16G tmpfs /mnt/ramdisk
/etc/exports:
/mnt/ramdisk    *(fsid=25,rw,async,no_root_squash)

client:
 mount -o proto=rdma,port=20049,rsize=1048576,wsize=1048576 server:/mnt/ramdisk /md1

Result was similar and similarly broken:

client write:
dd if=/dev/zero of=test2.img bs=1M count=10000
10485760000 bytes (10 GB) copied, 19.6019 s, 535 MB/s

client read:
dd if=test2.img of=/dev/null bs=1M
10485760000 bytes (10 GB) copied, 4.32758 s, 2.4 GB/s

Something is killing my NFS write speeds…

nx2l · November 25, 2018, 9:15pm

Tried sync instead of async?

cekim · November 25, 2018, 9:18pm

Yes - similar to worse.

nx2l · November 25, 2018, 9:21pm

How do things look during the transfer in top/iostat/vmstat/etc

cekim · November 25, 2018, 9:28pm

iotop doesn’t show ramdisk activity - during bcache tests it showed lumpy activity during write…

top shows nfsd spawning daemons as it should.

Didn’t run vmstat

cekim · November 25, 2018, 9:46pm

BTW - LVM’s version of raid6 is horrific with consumption of disk space…

43T with 10x8T disks?

nx2l · November 26, 2018, 12:15am

Raid in lvm is handled by mdadm last I heard

cekim · November 26, 2018, 12:23am

They clearly have a wrapper around it or something in place of it looking at the processes and proc/mdstat missing. This setup was mdadm raid 6 before and ~60TB. i.e.10-2 disks worth of space.

more /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
unused devices:

cekim · November 28, 2018, 7:26am

Well… nearly 3 days of lvm raid6 sync later…

dd if=/dev/zero of=test.img bs=1M count=10000 oflag=direct
10485760000 bytes (10 GB) copied, 29.2226 s, 359 MB/s

made even more baffling by this:

dd of=/dev/null if=test.img bs=1M
10485760000 bytes (10 GB) copied, 3.23696 s, 3.2 GB/s

It was configured as a writeback cache, but appears to be behaving as write-through…