I am interested in helping with any testing, brainstorming or otherwise with regard to ZFS performance over 100G. I have a rather high end setup with similar hardware to what was discussed in today’s video (TRUENAS SCALE: STATE OF THE BETA Q4 202) and would like to support in any way I can that is non-destructive to my data / pools.
OS: Arch Linux x86_64
Kernel: 5.15.4-arch1-1
CPU: AMD Ryzen Threadripper 3970X (64) @ 3.700GHz
GPU: NVIDIA GeForce RTX 3090
Memory: 10.21GiB / 62.80GiB (16%)
Network: mellanox / NVIDIA MCX516A-CDAT
Storage: 3 x Sabrent_Rocket_4.0_Plus_1Tb in a raid 0
Network:
details
The important part is that the desktop and server both connect to a Dell S5148F via 100G:
-- Server 2x100G DAC in a LACP (802.3ad)
-- Desktop 1x100G via Single mode Fiber
(the other link is used as a bridge interface for VMs
Current Type : S5148F
Hardware Revision : A00
Software Version : 10.4.3.6
Physical Ports : 48x25GbE, 6x100GbE
BIOS : 3.36.0.1-2
SMF-FPGA : 0.1
SMF-MSS : 1.2.2
Both the server and desktop have the following network based tweaks:
To save you from having to read the full thread to gather the key results, I will be summarizing them here:
Testing Notes
Note: All write results will show high CPU usage, as I am using a DRAID of 3 * 7 Raid Z2 vdevs and ZFS must calculate the parity and other internal functions.
Note: Where write results approach 40+Gbps, this is the theoretical limit of my storage pool assuming a 4% transport overhead. If your storage is faster, your results will scale more closely to the read results.
Note: Same situation for write IOPS, Although this more complicated as iops are not truly random, but random within a small subset of the drive using file based, non-destructive testing. I’d estimate that IOPS results approaching 2300-2400 are at the disk subsystem limits
Show Results
Host Native:
Notes
Note: I converted the number from GiB/s to Gbit/s by multiplying by 8 to better align with other results. (This is worst case as no protocol/transport overhead is factored in.)
Note: These results reflect best case and should be used as the reference for future results
Sequential IO
Seq. Read
Seq. Read CPU
Seq. Write
Seq. Write CPU
333.6 Gbit
<1%
84.0 Gbit
~75% (mostly z_we_int)
Random IO
Read IOPS
Read CPU
Write IOPS
Write CPU
R/W IOPS
R/W CPU
44.1K
<2.5%
2359
~60%
R: 2353 W: 2358
~63%
Linux client <--> Linux server:
Sequential IO
Engine
RDMA
Seq. Read
Seq. Read CPU
Seq. Write
Seq. Write CPU
Notes
KSMBD
YES
69.3 Gbit
~5%
40.0 Gbit
~65%
Scales to parallel jobs, large # of outstanding IO well
KSMBD
NO
19.9 Gbit
~5.5%
16.4 Gbit
~30%
SAMBA
NO
19.7 Gbit
~6%
16.8 Gbit
~35%
comparable to ksmbd, but less efficient better enterprise features ex. acls
NFS
YES
99.9 Gbit
~11.7%
98 Gbit
~40%
Write figures reflect accurate network IO, but CPU figure is suspect as I believe not all IO is written to disk, limiting time zfs spends calculating hash and parity (arc churn)
NFS
NO
35.5 Gbit
~10%
21.4 Gbit
~27%
Random IO
Engine
RDMA
Read IOPS
Read CPU
Write IOPS
Write CPU
R/W IOPS
R/W CPU
Notes
KSMBD
YES
8483
5.2%
2331
~60%
R: 2241 W: 2243
~60%
KSMBD
NO
9707
5.5%
1027
~55%
R: 1503 W: 1509
~59%
SAMBA
NO
9760
6%
1926
~55%
R: 1872 W: 1877
~57%
NFS
YES
39.1k
50.5%
2249
~60%
R: 2295 W: 2300
~62%
NFS
NO
33.8k
40%
2285
~60%
R: 2286 W: 2291
~70%
Linux Client <--> Windows server VM Finding Summary:
Setup Notes:
I enabled SR_IOV & IOMMU on the server and created a QEMU/libvirt VM passing through a virtual mellanox pci device for the NIC. I then created a 200G RAW (not qcow2) storage device stored on a dir based storage pool backed by the same ZFS pool as the other testing.
VM SPECS: 8 vCPU | 8GB Ram
NOTE: I have captured the CPU figures from the HOST cpu consumption, not guest to better align with other results.
Sequential IO
Engine
RDMA
Seq. Read
Seq. Read CPU
Seq. Write
Seq. Write CPU
Notes
SMB
YES
81.2 Gbit
~25%
23.0 Gbit
~73%
Scales to parallel jobs, large # of outstanding IO well
SMB
NO
19.6 Gbit
~21%
20.9 Gbit
~30%
Random IO
Engine
RDMA
Read IOPS
Read CPU
Write IOPS
Write CPU
R/W IOPS
R/W CPU
Notes
SMB
YES
37.1k
~63%
2359
~70%
R: 2345 W: 2354
80%
SMB
NO
37.1k
67%
2354
~77%
R: 2362 W: 2369
~75%
Windows Client <--> Windows server VM Finding Summary:
Setup Note: Same VM setup and cpu methodology as linux <-> windows VM
Sequential IO
Engine
RDMA
Seq. Read
Seq. Read CPU
Seq. Write
Seq. Write CPU
Notes
SMB
YES
100Gbit
~32%
52.9 Gbit
~79%
Scales to parallel jobs, large # of outstanding IO well
SMB
YES - 2 NICs
184 Gbit!!!!
~50%
not tested
~n/a
Multi-channel + RDMA (WOW)
SMB
NO
100 Gbit
~26%
35.1 Gbit
~80%
SMB Multichannel works well here
Random IO
Engine
RDMA
Read IOPS
Read CPU
Write IOPS
Write CPU
R/W IOPS
R/W CPU
Notes
SMB
YES
36.0k
72%
2354
~77%
R: 2282 W: 2280
~50%
SMB
no
35.7k
~70%
2293
~47%
R: 2279 W: 2277
51%
Linux VM <-> Host with VIRTIO socket Finding Summary:
Setup Note: Arch VM with 8 cores, 8GB of ram backed by hugepages Setup Note: Host leveraged virtiofs socket with a threadpool size of 32 Setup Note: I converted the number from GiB/s to Gbit/s by multiplying by 8 to better align with other results. (This is worst case as no protocol/transport overhead is factored in.)
Sequential IO
Engine
RDMA
Seq. Read
Seq. Read CPU
Seq. Write
Seq. Write CPU
Notes
VIRTIOFS
N/A
227.2 Gbit
~40%
46.7 Gbit
~80%
Random IO
Engine
RDMA
Read IOPS
Read CPU
Write IOPS
Write CPU
R/W IOPS
R/W CPU
Notes
SRIOVFS
N/A
49k
48%
2477
~72%
R: 2463 W: 22467
~75%
Environment Updates and configurations:
Environment changes and Details
Server Updates:
SR-IOV
Enable SR_IOV
To facilitate achieving the best performance, the server was configured to enable SR_IOV on the NICs to allow for passing vNIC PCI devices to the VM
BIOS Changes Required
I am Running a Supermicro H12SSL-C bios ver 2.3 date 10/20/2021 and made the following changes. NOTE: Enabled means Enabled, not AUTO
Advanced - ACPI:
I needed to ensure that PCI AER support was on to allow for PCI- ARI
note: for intel you may need to set intel_iommu=on (or similar)
Network Card changes:
sudo mstconfig -d 81:00.1 set SRIOV_EN=1 NUM_OF_VFS=4
-- SPECIFIC TO MELLANOX FIRMWARE
-- substitute 81:00.1 for the PCI address of yur Card
-- NUM_OF_VFS = the number of virtual cardd to create
-- run this once for each NIC
echo 4 | sudo tee /sys/class/net/enp129s0f1np1/device/sriov_numvfs
-- Tell linux to unitialize the 4 virtual adapters
-- Replace enp129s0f1np1 with the interface name (identidy with 'ip link')
-- run once per NIC
Validate IOMMU
Run the following to output a list of IOMMU groups, hopefully each vNIC is in it's owm IOMMU Group.
for g in `find /sys/kernel/iommu_groups/* -maxdepth 0 -type d | sort -V`; do
echo "IOMMU Group ${g##*/}:"
for d in $g/devices/*; do
echo -e "\t$(lspci -nns ${d##*/})"
done;
done;
Optimize libvirt
I don't know if it helped the results, but I enabled hugepages in the VM
Linux Host changes
Change hugepage options:
I added the following line to /etc/fstab where 992 is the guid of kvm
# used to verify hugepagesize
grep -i Hugepagesize /proc/meminfo
Hugepagesize: 2048 kB
# knowing each hugepage was 2MB, I set up 10GB of hugepages
echo 5120 | sudo tee /proc/sys/vm/nr_hugepages
have you used ksmbd yet? it has rdma (but not multichannel)
also, what about the passthrough filesystem driver for windows (the one redhat signed) wondering if I can transparently access a zfs share from the guest windows OS at faster speeds than something like loopback iscsi
We’ve been doing database benchmarking. For storage I have a striped setup of 4 nvme drives. Freshly formatted my storage can do 10GB/s read, 5.5GB/s write, 1.4MIOPS read, 1.3MIOPS write.
I use vagrant, libvirt, Ansible, and container to make my benchmarking portable and document the config as much as possible. As usual storage is a pain. I want it to be easy to script up but also because I’m benchmarking databases I need the storage to be as fast as practical. So running a database on a filesystem in a QCOW2 image on a filesystem doesn’t work for me. LVM as storage pool working with vagrant and libvirt didn’t work a automagically as the filesystem backed storage pools. Eventually I want to move the storage array to another box and do NVMeoF anyway. So I figured I’d have my host run an NVMeoF target and have the VM’s do a NVMeoF client and script up the drive creation in Ansible.
It was very easy to prototype, nvmetcli made setting up the target exceptionally easy. Client side was even easier. However… no RDMA over the virtual network and performance sucked. I got 3GB/s and 1.5GB/s instead of 10/5.5. After some debug with iperf3 I found that this was a limit with the virtual network. Actually that was the speed limit for using localhost networking, too. So I abandoned that and just deal with some manual steps with LVM now which was pretty close to host performance.
Conclusions:
NVMeoF is definitely where I’m going eventually. NVMe is definitely the better protocol from a performance way and the tools are pretty easy. Getting RDMA working looked trivial assuming I had a NIC that supported it.
If you’re looking to do RDMA from within the VM I think there are two options:
KSMB, not yet, can look into over the next couple days.
The “Desktop” above is a dual boot with WIN10 and also has windows guests available under the linux OS. Can you provide a link or two for the passthrough Filesystem?
So… I have KSMBD set up and working without RDMA… How do I enable RDMA? I don’t see anything here about it:
https : // github .com/cifsd-team/ksmbd-tools/blob/master/Documentation/configuration.txt
Throughput test with Linux Client
FYI: I’m using the ‘Peak’ switch statistics as gathering it’s easier than getting the stats on the host when using RDMA and should be apples:apples
Single instance
results
Command:
bonnie++ -d __PATH__ -n 0 -f -c4
Results:
KSMBD (NO RDMA):
-- Write to Server: 17026 Mbits/sec, 235589 packets/sec, 17% of line rate
-- Read From Server: 16061 Mbits/sec, 222553 packets/sec, 16% of line rate
KSMBD (RDMA):
-- Write to Server: 20963 Mbits/sec, 632419 packets/sec, 20% of line rate
-- Read From Server: 17958 Mbits/sec, 541960 packets/sec, 17% of line rate
SAMBA (SMB)
-- Write to Server: 22844 Mbits/sec, 316170 packets/sec, 22% of line rate
-- Read From Server: 16915 Mbits/sec, 234754 packets/sec, 16% of line rate
NFS (RDMA):
-- Write to Server: 17808 Mbits/sec, 536497 packets/sec, 17% of line rate
-- Read From Server: 11512 Mbits/sec, 364946 packets/sec, 11% of line rate
Parallel instances:
results
Command:
bonnie++ -p8
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
Results:
KSMBD (NO RDMA):
-- Write to Server: 16908 Mbits/sec, 233860 packets/sec, 16% of line raterate
-- Read From Server: 25428 Mbits/sec, 352402 packets/sec, 25% of line rate
KSMBD (RDMA):
-- Write to Server: 39502 Mbits/sec, 1191122 packets/sec, 39% of line rate
-- Read From Server: 41601 Mbits/sec, 1255649 packets/sec, 41% of line rate
SAMBA (no RDMA)
-- Write to Server: 22844 Mbits/sec, 316170 packets/sec, 22% of line rate
-- Read From Server: 3461 Mbits/sec, 51796 packets/sec, 3% of line rate
NFS (RDMA):
-- Write to Server: 40182 Mbits/sec, 1209318 packets/sec, 40% of line rate
-- Read From Server: 68411 Mbits/sec, 2168665 packets/sec, 68% of line rate
I was trying to test with PC booted in linux, but no luck mounting with the RDMA option:
FAILS: sudo mount -t cifs //server/temp temp -o vers=3.1.1,rdma
WORKS: sudo mount -t cifs //server/temp temp -o vers=3.1.1
I’ll test Windows soon using IOZone or similar. Also, will update the previous post with Stats leveraging SAMBA Multichannel and create a similar post for windows 10 stats (exc. NFS)
Can you try to enable all the ksmbd debug prints and then do the following operations followed by recording the output of ‘sudo dmesg -c > mount_rdma.txt’ then share the three output files here?
What’s the reasoning with -Sr? what about -w100 and -w0?
I added a warmup period of 10 seconds (-W10 - Capital W, not lowercase w) and disabled local caching on the client (-Sr) to ensure all the reported IO traversed the network. FYI: with no -w (lower case), the default is 100% reads.
As for speed, the 32Gbit is better than the linux cifs mount (32Gbit vs 25.4Gbit), but I assume the IO profile is different from bonnie++ to diskspd. I was not capturing the same level of detail as with the linux results at this time, but wanted to provide a quick update; I’ll post more detailed result when I have the time to duplicate the tests.
“did not see RDMA used” - how so?
When using RDMA, the server does not reflect the network IO in the network counter within htop, as the data does not follow the typical stack.
Ex: When doing 70Gbit+ over RDMA (NFS), the network counters show <1mbit of traffic (ssh session, other misc traffic). FYI: This is also why I’m using switch statistics to measure traffic to get consistent values.
Also, not sure if it matters as its supposed to be automatic, but I’m just mounting the share with a windows “NET USE…” command (yes, I’m dating myself a little with that one, but prefer the cli)
May have found something, but haven’t had a chance to mess with it yet:
Looking at the windows event log:
Event Details
LogName : Microsoft-Windows-SmbClient/Connectivity
Id : 30822
TimeCreated : 12/5/2021 3:06:31 PM
Level : 4
Message : Failed to establish an SMB multichannel network connection.
Error: The transport connection attempt was refused by the remote system.
Server name: nas.domain.com
Server address: 10.0.1.100:445
Client address: 10.0.1.117
Instance name: \Device\LanmanRedirector
Connection type: Wsk
Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when
using an iWARP RDMA adapter can also cause this issue. Since the error occurred while trying to connect extra channels, it will not result in an application error.
This event is for diagnostics only.
looking at the server (linux), it appears that I need to enable PFC / ETS:
update: Wrong, not required
Also: on the Dell OS10 switch
The S5148F is a great value at ~$1200 - $1400 US for a 48x25G + 6x600G switch, but it can’t run most open network OSs like Sonic, etc. What I’ve come up with so far but no luck with SMB RDMA… Does RDMA with SMB work over a 802.3ad LAGGs? (see ports 1/1/51-1/1/52 below)
switch config snippets
snippet
...
class-map type network-qos nqosmap_rdma
match qos-group 3
!
policy-map type application policy-iscsi
!
policy-map type network-qos p_nqos_rdma
!
class nqosmap_rdma
pause
pfc-cos 3
!
system qos
trust-map dot1p default
!
...
!
interface port-channel3
description nas_10.0.1.100
no shutdown
switchport access vlan 1
mtu 9216
spanning-tree port type edge
!
...
!
interface ethernet1/1/49
description Desktop
no shutdown
switchport access vlan 1
mtu 9216
flowcontrol receive off
flowcontrol transmit off
priority-flow-control mode on
service-policy input type network-qos p_nqos_rdma
spanning-tree port type edge
!
interface ethernet1/1/50
description Desktop
no shutdown
switchport access vlan 1
mtu 9216
flowcontrol receive off
flowcontrol transmit off
priority-flow-control mode on
service-policy input type network-qos p_nqos_rdma
spanning-tree port type edge
!
interface ethernet1/1/51
description nas_10.0.1.100
no shutdown
channel-group 3 mode active
no switchport
mtu 9216
flowcontrol receive off
flowcontrol transmit off
priority-flow-control mode on
service-policy input type network-qos p_nqos_rdma
!
interface ethernet1/1/52
description nas_10.0.1.100
no shutdown
channel-group 3 mode active
no switchport
mtu 9216
flowcontrol receive off
flowcontrol transmit off
priority-flow-control mode on
service-policy input type network-qos p_nqos_rdma
!
I feel like I’m asking more questions than providing assistance (new to DCB in general), but once we get over this hurdle, hopefully I can validate findings or run tests, etc.
I haven’t done much in this sphere since 56Gbit IB was a big deal, so take my advice with tablespoon of salt. But at first pass my gut says PFC is big deal for optimizing performance sometimes but shouldn’t make or break your ability to make the RDMA connection.
I think the big clue here is that the multichannel is failing. The ksmbd docs make mention of multichannel being a requirement for RDMA. But that both features are partially supported. I don’t understand why that should be the case at the protocol level, but it’s possible that’s an implementation quirk of the Windows client code.
Perhaps if you can get the multichannel working, you get to the part where the RDMA fails!
Get-SmbMultichannelConnection
Server Name Selected Client IP Server IP Client Interface Index Server Interface Index Client RSS Capable Client RDMA Capable
----------- -------- --------- --------- ---------------------- ---------------------- ------------------ -------------------
nas True 10.0.1.10 10.0.1.100 19 2 False False
nas True 10.0.1.117 10.0.1.100 18 2 False False
looking further into RDMA, I’m now getting the following event logs:
Windows Events
LogName : Microsoft-Windows-SmbClient/Connectivity
Id : 30822
TimeCreated : 12/6/2021 1:34:49 PM
Level : 4
Message : Failed to establish an SMB multichannel network connection.
Error: The transport connection attempt was refused by the remote system.
Server name: nas
Server address: 10.0.1.100:445
Client address: 10.0.1.117
Instance name: \Device\LanmanRedirector
Connection type: Wsk
Guidance:
This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when
using an iWARP RDMA adapter can also cause this issue. Since the error occurred while trying to connect extra channels, it will not result in an application error.
This event is for diagnostics only.
LogName : Microsoft-Windows-SmbClient/Connectivity
Id : 30804
TimeCreated : 12/6/2021 1:34:47 PM
Level : 2
Message : A network connection was disconnected.
Instance name: \Device\LanmanRedirector
Server name: \nas
Server address: 10.0.1.100:445
Connection type: Wsk
InterfaceId: 19
Guidance:
This indicates that the client's connection to the server was disconnected.
Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow
Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects
and poor performance.
The server only reports that there is no IPv6, I assume that’s not a requirement?
The log you linked, what was the workflow you captured? Approximate timing would be nice. Example: At time 0 I mounted the fs, ~10s later I read, ~20s after reading I write a file.
Also, I find it odd that windows reports that the ‘client RDMA capable’ is false as shown above in the Get-SmbMultichannelConnection output, but Get-NetAdapterRDMA reports it as available?
Get-NetAdapterRDMA
Name InterfaceDescription Enabled PFC ETS
---- -------------------- ------- --- ---
100G_1 Mellanox ConnectX-5 Ex Adapter True False False
100G_2 Mellanox ConnectX-5 Ex Adapter #2 True False False
I saw ZFS and “performance” mentioned together… figured I’d bring up for those that aren’t aware yet, when 3.0 comes out it should have DirectIO which can potentially resolve some bottlenecks utilizing NVMe drives, though it’s mostly writes that benefit the most.