A Neverending Story: PCIe 3.0/4.0/5.0 Bifurcation, Adapters, Switches, HBAs, Cables, NVMe Backplanes, Risers & Extensions - The Good, the Bad & the Ugly

Hardware:

  • Motherboard: GigaByte MZ72-HB0
  • CPU: 2x EPYC 7443
  • Drives: 4x Intel D5-P5316 u.2 PCI-e 4.0
  • PCI HBA: HighPoint SSD7580A (4 SlimSAS 8i ports)
  • Cables: 2x HighPoint Slim SAS SFF-8654 to 2x SFF-8639 NVMe Cable for SSD7580

I had a bit of journey (explained below) but ultimately I could
achieve a bandwidth of 28.4 GB/s on a Linux md raid0 array and sequential read with fio. The key insight is on Epyc (and Threadripper?) you may want to change the nodes per socket (NPS) in the BIOS.

fio Results:

# /opt/fio/bin/fio --bs=2m --rw=read --time_based --runtime=60 --direct=1 --iodepth=16 --size=100G --filename=/dev/md0 --numjobs=1 --ioengine=libaio --name=seqread
seqread: (g=0): rw=read, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=libaio, iodepth=16
fio-3.30
Starting 1 process
Jobs: 1 (f=1): [R(1)][100.0%][r=26.5GiB/s][r=13.6k IOPS][eta 00m:00s]
seqread: (groupid=0, jobs=1): err= 0: pid=10378: Wed Jul 27 09:30:41 2022
  read: IOPS=13.5k, BW=26.4GiB/s (28.4GB/s)(1587GiB/60001msec)
    slat (usec): min=32, max=578, avg=63.60, stdev=23.46
    clat (usec): min=264, max=4962, avg=1117.58, stdev=79.42
     lat (usec): min=328, max=5369, avg=1181.28, stdev=79.04
    clat percentiles (usec):
     |  1.00th=[  963],  5.00th=[ 1057], 10.00th=[ 1074], 20.00th=[ 1090],
     | 30.00th=[ 1090], 40.00th=[ 1106], 50.00th=[ 1123], 60.00th=[ 1123],
     | 70.00th=[ 1123], 80.00th=[ 1139], 90.00th=[ 1139], 95.00th=[ 1172],
     | 99.00th=[ 1434], 99.50th=[ 1598], 99.90th=[ 2024], 99.95th=[ 2114],
     | 99.99th=[ 2343]
   bw (  MiB/s): min=23616, max=27264, per=100.00%, avg=27101.38, stdev=407.26, samples=119
   iops        : min=11808, max=13632, avg=13550.69, stdev=203.63, samples=119
  lat (usec)   : 500=0.01%, 750=0.12%, 1000=1.73%
  lat (msec)   : 2=98.02%, 4=0.11%, 10=0.01%
  cpu          : usr=1.45%, sys=88.36%, ctx=338808, majf=0, minf=8212
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=812351,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
   READ: bw=26.4GiB/s (28.4GB/s), 26.4GiB/s-26.4GiB/s (28.4GB/s-28.4GB/s), io=1587GiB (1704GB), run=60001-60001msec

Disk stats (read/write):
    md0: ios=13175835/0, merge=0/0, ticks=8504656/0, in_queue=8504656, util=99.93%, aggrios=3249404/0, aggrmerge=50764/0, aggrticks=2083607/0, aggrin_queue=2083607, aggrutil=99.85%
  nvme3n1: ios=3249404/0, merge=203059/0, ticks=2283093/0, in_queue=2283093, util=99.85%
  nvme0n1: ios=3249404/0, merge=0/0, ticks=2059957/0, in_queue=2059957, util=99.85%
  nvme1n1: ios=3249404/0, merge=0/0, ticks=1889297/0, in_queue=1889297, util=99.85%
  nvme2n1: ios=3249404/0, merge=0/0, ticks=2102083/0, in_queue=2102083, util=99.85%

Initial testing of this setup I was only able to achieve between 7 to 9 GB/s depending on setup/options, each individual drive was able to achieve 6.9 GB/s on it’s own so it was hitting a
bottleneck somewhere.

What was also annoying was I had asymmetric direct IO access speeds, 7GB/s from
one CPU socket and 1GB/s from the other socket. I expect some dip but this was
more than I expected. The OS was running on a Gen4 x4 m.2 and has similar access
speeds from either socket.

I contacted HighPoint support and they were helpful. There was some back and
forth with of “is this plugged into x16 slot?”, “are use pinning the process
to a given NUMA node?”
, “are our drivers installed?” etc. It turns out the magic
option was to set the Nodes Per Socket (NPS) to NPS4 in the BIOS. This means I
have a NUMA node per core complex (4 per socket 8 total). I had found some info
on setting the NPS in Epyc tuning guides but no mention of it’s impact PCI-e
bandwidth. We have another machine with 2x Epyc 7302 (PCI-e 3.0)
and HighPoint SSD7540 with 7x m.2 drives. Previously I could only get 6GB/s
which I presumed was some bottleneck/inefficiency on using an odd number of
drives. After setting the NPS4 in the BIOS on this it bumped up to 13.3GB/s.

If anyone knows why the PCI-e gets bottlenecked when NPS1 (default) is used I
would be curious. My guess is to make the multiple core complexes appear as a
single node per socket the infinity fabric uses some lanes, but this is
pure speculation. There is an old post by Wendel on “Fixing Slow NVMe Raid Performance on Epyc”, I didn’t want to bump but I wonder if the nodes per socket would impact this also.

Back to the Epyc 7443 system here are the bandwidth numbers when pinning to
different CPUs (fio --cpus_allowed=<ids>).

The drives are connected on NUMA Node 1

$ cat /sys/block/nvme[0-3]n1/device/numa_node
1
1
1
1

Curiously I get best results if I allow any CPU on Socket 0 rather than
just the node the drives are connected to.

Cores: 0-23: (Socket 0, NUMA nodes 0,1,2,3)
   READ: bw=26.3GiB/s (28.3GB/s), 26.3GiB/s-26.3GiB/s (28.3GB/s-28.3GB/s), io=1580GiB (1696GB), run=60001-60001msec
Cores: 24-47 (Socket 1, NUMA nodes 4,5,6,7)
   READ: bw=21.4GiB/s (23.0GB/s), 21.4GiB/s-21.4GiB/s (23.0GB/s-23.0GB/s), io=1283GiB (1377GB), run=60001-60001msec
Cores: 6-11: (Socket 0, NUMA node 1)
   READ: bw=24.8GiB/s (26.6GB/s), 24.8GiB/s-24.8GiB/s (26.6GB/s-26.6GB/s), io=1486GiB (1596GB), run=60001-60001msec

We went with the HighPoint HBA as we were originally going to use a Supermicro H12. The Gigabyte MZ72-HB0 motherboard actually has the ability to plug 5x SlimSAS 4i in directly, although this is 2 on Socket 0 and 3 on Socket 1. The motherboard manual is quite sparse on details but I can confirm all the SlimSAS slots are 4i although only 2 are labelled as such. It is quite hard to find information of the SFF connector spec/speeds but my understanding is MiniSAS is 12Gb/s per lane and 4 lanes (so 6 GB/s), SlimSAS is 24Gb/s per lane (so 12 GB/s). My understanding is therefore you should NOT use MiniSAS bifurcation for Gen4 SSDs if you want peak performance.

I found this forum thread before I set the nodes per socket in the BIOS and decided to order and try some SlimSAS 4i to u.2 from Amazon (YIWENTEC SFF 8654 4i Slimline sas to SFF 8639). These were blue-ish cables and given previous mentions of poor quality (albeit different connector types I think) I didn’t have much hope. However testing these cables I saw expected speeds and no PCI-e bus errors YMMV if connecting via a bifurcation card. DeLock seems to be the only option there but hopefully there are more possibilities in future, particularly with Gen5 incoming.

Edit: You need to change the BIOS settings for 3 of the SlimSAS 4i from SATA to NVME.

6 Likes