How Can I Help with the new TRUENAS / 100G testing?

wallacebw · December 2, 2021, 9:17pm

L1 Team:

I am interested in helping with any testing, brainstorming or otherwise with regard to ZFS performance over 100G. I have a rather high end setup with similar hardware to what was discussed in today’s video (TRUENAS SCALE: STATE OF THE BETA Q4 202) and would like to support in any way I can that is non-destructive to my data / pools.

Here's the summary of my environment:

Ervironment

Server:

details

OS: Arch Linux x86_64 (yeah, I know, but I wanted to be an early adopter of DRAID, may move to Debian once they support ZFS 2.1)
Kernel: 5.15.4-arch1-1
CPU: AMD EPYC 7302P (32) @ 3.000GHz
Memory: 241469MiB / 257673MiB
Network: mellanox / NVIDIA MCX516A-CDAT
Storage:
 -- Special Volume Samsung  PM173X SSD mirror
 -- 3.5in TOSHIBA_MG08ACA16TE x 24

ZFS setup:
       NAME                  STATE     READ WRITE CKSUM
	data                  ONLINE       0     0     0
	  draid2:5d:22c:1s-0  ONLINE       0     0     0
	    Bay00             ONLINE       0     0     0
	    Bay01             ONLINE       0     0     0
	    Bay02             ONLINE       0     0     0
	    Bay03             ONLINE       0     0     0
	    Bay04             ONLINE       0     0     0
	    Bay05             ONLINE       0     0     0
	    Bay06             ONLINE       0     0     0
	    Bay07             ONLINE       0     0     0
	    Bay08             ONLINE       0     0     0
	    Bay09             ONLINE       0     0     0
	    Bay10             ONLINE       0     0     0
	    Bay11             ONLINE       0     0     0
	    Bay12             ONLINE       0     0     0
	    Bay13             ONLINE       0     0     0
	    Bay14             ONLINE       0     0     0
	    Bay15             ONLINE       0     0     0
	    Bay16             ONLINE       0     0     0
	    Bay17             ONLINE       0     0     0
	    Bay18             ONLINE       0     0     0
	    Bay19             ONLINE       0     0     0
	    Bay20             ONLINE       0     0     0
	    Bay21             ONLINE       0     0     0
	special	
	  mirror-1            ONLINE       0     0     0
	    Special00         ONLINE       0     0     0
	    Special01         ONLINE       0     0     0
	spares
	  draid2-0-0          AVAIL

Desktop:

details

OS: Arch Linux x86_64
Kernel: 5.15.4-arch1-1
CPU: AMD Ryzen Threadripper 3970X (64) @ 3.700GHz
GPU: NVIDIA GeForce RTX 3090
Memory: 10.21GiB / 62.80GiB (16%)
Network: mellanox / NVIDIA MCX516A-CDAT
Storage: 3 x Sabrent_Rocket_4.0_Plus_1Tb in a raid 0

Network:

details

The important part is that the desktop and server both connect to a Dell S5148F via 100G:
 -- Server 2x100G DAC in a LACP (802.3ad)
 -- Desktop 1x100G via Single mode Fiber 
    (the other link is used as a bridge interface for VMs

Current Type               : S5148F
Hardware Revision          : A00
Software Version           : 10.4.3.6
Physical Ports             : 48x25GbE, 6x100GbE
BIOS                          : 3.36.0.1-2
SMF-FPGA                      : 0.1
SMF-MSS                       : 1.2.2

Both the server and desktop have the following network based tweaks:

raw data

net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_sack = 1
net.core.rmem_max = 2147483647
net.core.wmem_max = 2147483647
net.ipv4.tcp_rmem = 4096 87380 1073741824
net.ipv4.tcp_wmem = 4096 65536 1073741824
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_mtu_probing=1
net.core.default_qdisc = fq

Summary of testing results:

To save you from having to read the full thread to gather the key results, I will be summarizing them here:

Testing Notes

Note: All write results will show high CPU usage, as I am using a DRAID of 3 * 7 Raid Z2 vdevs and ZFS must calculate the parity and other internal functions.

Note: Where write results approach 40+Gbps, this is the theoretical limit of my storage pool assuming a 4% transport overhead. If your storage is faster, your results will scale more closely to the read results.

Note: Same situation for write IOPS, Although this more complicated as iops are not truly random, but random within a small subset of the drive using file based, non-destructive testing. I’d estimate that IOPS results approaching 2300-2400 are at the disk subsystem limits

Show Results

Host Native:

Notes

Note: I converted the number from GiB/s to Gbit/s by multiplying by 8 to better align with other results. (This is worst case as no protocol/transport overhead is factored in.)

Note: These results reflect best case and should be used as the reference for future results

Sequential IO

Seq. Read	Seq. Read CPU	Seq. Write	Seq. Write CPU
333.6 Gbit	<1%	84.0 Gbit	~75% (mostly z_we_int)

Random IO

Read IOPS	Read CPU	Write IOPS	Write CPU	R/W IOPS	R/W CPU
44.1K	<2.5%	2359	~60%	R: 2353 W: 2358	~63%

Linux client <--> Linux server:

Sequential IO

Engine	RDMA	Seq. Read	Seq. Read CPU	Seq. Write	Seq. Write CPU	Notes
KSMBD	YES	69.3 Gbit	~5%	40.0 Gbit	~65%	Scales to parallel jobs, large # of outstanding IO well
KSMBD	NO	19.9 Gbit	~5.5%	16.4 Gbit	~30%
SAMBA	NO	19.7 Gbit	~6%	16.8 Gbit	~35%	comparable to ksmbd, but less efficient better enterprise features ex. acls
NFS	YES	99.9 Gbit	~11.7%	98 Gbit	~40%	Write figures reflect accurate network IO, but CPU figure is suspect as I believe not all IO is written to disk, limiting time zfs spends calculating hash and parity (arc churn)
NFS	NO	35.5 Gbit	~10%	21.4 Gbit	~27%

Random IO

Engine	RDMA	Read IOPS	Read CPU	Write IOPS	Write CPU	R/W IOPS	R/W CPU
KSMBD	YES	8483	5.2%	2331	~60%	R: 2241 W: 2243	~60%
KSMBD	NO	9707	5.5%	1027	~55%	R: 1503 W: 1509	~59%
SAMBA	NO	9760	6%	1926	~55%	R: 1872 W: 1877	~57%
NFS	YES	39.1k	50.5%	2249	~60%	R: 2295 W: 2300	~62%
NFS	NO	33.8k	40%	2285	~60%	R: 2286 W: 2291	~70%

Linux Client <--> Windows server VM Finding Summary:

Setup Notes:

I enabled SR_IOV & IOMMU on the server and created a QEMU/libvirt VM passing through a virtual mellanox pci device for the NIC. I then created a 200G RAW (not qcow2) storage device stored on a dir based storage pool backed by the same ZFS pool as the other testing.

VM SPECS: 8 vCPU | 8GB Ram

NOTE: I have captured the CPU figures from the HOST cpu consumption, not guest to better align with other results.

Sequential IO

Engine	RDMA	Seq. Read	Seq. Read CPU	Seq. Write	Seq. Write CPU	Notes
SMB	YES	81.2 Gbit	~25%	23.0 Gbit	~73%	Scales to parallel jobs, large # of outstanding IO well
SMB	NO	19.6 Gbit	~21%	20.9 Gbit	~30%

Random IO

Engine	RDMA	Read IOPS	Read CPU	Write IOPS	Write CPU	R/W IOPS	R/W CPU	Notes
SMB	YES	37.1k	~63%	2359	~70%	R: 2345 W: 2354	80%
SMB	NO	37.1k	67%	2354	~77%	R: 2362 W: 2369	~75%

Windows Client <--> Windows server VM Finding Summary:

Setup Note: Same VM setup and cpu methodology as linux <-> windows VM

Sequential IO

Engine	RDMA	Seq. Read	Seq. Read CPU	Seq. Write	Seq. Write CPU	Notes
SMB	YES	100Gbit	~32%	52.9 Gbit	~79%	Scales to parallel jobs, large # of outstanding IO well
SMB	YES - 2 NICs	184 Gbit!!!!	~50%	not tested	~n/a	Multi-channel + RDMA (WOW)
SMB	NO	100 Gbit	~26%	35.1 Gbit	~80%	SMB Multichannel works well here

Random IO

Engine	RDMA	Read IOPS	Read CPU	Write IOPS	Write CPU	R/W IOPS	R/W CPU	Notes
SMB	YES	36.0k	72%	2354	~77%	R: 2282 W: 2280	~50%
SMB	no	35.7k	~70%	2293	~47%	R: 2279 W: 2277	51%

Linux VM <-> Host with VIRTIO socket Finding Summary:

Setup Note: Arch VM with 8 cores, 8GB of ram backed by hugepages
Setup Note: Host leveraged virtiofs socket with a threadpool size of 32
Setup Note: I converted the number from GiB/s to Gbit/s by multiplying by 8 to better align with other results. (This is worst case as no protocol/transport overhead is factored in.)

Sequential IO

Engine	RDMA	Seq. Read	Seq. Read CPU	Seq. Write	Seq. Write CPU	Notes
VIRTIOFS	N/A	227.2 Gbit	~40%	46.7 Gbit	~80%

Random IO

Engine	RDMA	Read IOPS	Read CPU	Write IOPS	Write CPU	R/W IOPS	R/W CPU	Notes
SRIOVFS	N/A	49k	48%	2477	~72%	R: 2463 W: 22467	~75%

Environment Updates and configurations:

Environment changes and Details

Server Updates:

SR-IOV

Enable SR_IOV

To facilitate achieving the best performance, the server was configured to enable SR_IOV on the NICs to allow for passing vNIC PCI devices to the VM

BIOS Changes Required

I am Running a Supermicro H12SSL-C bios ver 2.3 date 10/20/2021 and made the following changes. NOTE: Enabled means Enabled, not AUTO

Advanced - ACPI:

I needed to ensure that PCI AER support was on to allow for PCI- ARI

acpi

Advanced - CPU Configuration:

SVM Mode set to Enabled

CPU

Advanced - North Bridge Configuration:

IOMMU set to Enabled
ACS Enable set to Enabled

NorthBridge

Advanced - PCI Devices Common Settings:

SR-IOV Support set to Enabled
PCIe ARI Support set to Enabled

PCI

Linux Host changes

Kernel Parameters:
ref: https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.htm

iommu=pt
note: for intel you may need to set intel_iommu=on (or similar)

Network Card changes:

sudo mstconfig -d 81:00.1 set SRIOV_EN=1 NUM_OF_VFS=4 
 -- SPECIFIC TO MELLANOX FIRMWARE
 -- substitute 81:00.1 for the PCI address of yur Card
 -- NUM_OF_VFS = the number of virtual cardd to create 
 -- run this once for each NIC

echo 4 | sudo tee /sys/class/net/enp129s0f1np1/device/sriov_numvfs
 -- Tell linux to unitialize the 4 virtual adapters 
 -- Replace enp129s0f1np1 with the interface name (identidy with 'ip link')
 -- run once per NIC

Validate IOMMU

Run the following to output a list of IOMMU groups, hopefully each vNIC is in it's owm IOMMU Group.

for g in `find /sys/kernel/iommu_groups/* -maxdepth 0 -type d | sort -V`; do
    echo "IOMMU Group ${g##*/}:"
    for d in $g/devices/*; do
        echo -e "\t$(lspci -nns ${d##*/})"
    done;
done;

Optimize libvirt

I don't know if it helped the results, but I enabled hugepages in the VM

Linux Host changes

Change hugepage options:
I added the following line to /etc/fstab where 992 is the guid of kvm

hugetlbfs       /dev/hugepages  hugetlbfs       mode=01770,gid=992        0 0

and enabled hugepages

# used to verify hugepagesize 
grep -i Hugepagesize /proc/meminfo
Hugepagesize:       2048 kB
# knowing each hugepage was 2MB, I set up 10GB of hugepages
echo 5120 | sudo tee /proc/sys/vm/nr_hugepages

VM Settings

I enabled hugepages and passed through the vNICs

<domain type="kvm">
...
  <memoryBacking>
    <hugepages>
      <page size="2048" unit="KiB"/>
    </hugepages>
    <access mode="shared"/>
  </memoryBacking>
...

vNIC:

Screenshot from 2021-12-11 08-19-48

Server Updates:

Linux Client Changes

I had to recompile the kernel (beyond the scope of this post) and enable:

CONFIG_CIFS_SMB_DIRECT=y

Note: This section is a work in progress, more to come when time permits.

wendell · December 3, 2021, 1:56am

have you used ksmbd yet? it has rdma (but not multichannel)

also, what about the passthrough filesystem driver for windows (the one redhat signed) wondering if I can transparently access a zfs share from the guest windows OS at faster speeds than something like loopback iscsi

Jared_Hulbert · December 3, 2021, 5:02am

Oh goody. RDMA! Virtualization!

We’ve been doing database benchmarking. For storage I have a striped setup of 4 nvme drives. Freshly formatted my storage can do 10GB/s read, 5.5GB/s write, 1.4MIOPS read, 1.3MIOPS write.

I use vagrant, libvirt, Ansible, and container to make my benchmarking portable and document the config as much as possible. As usual storage is a pain. I want it to be easy to script up but also because I’m benchmarking databases I need the storage to be as fast as practical. So running a database on a filesystem in a QCOW2 image on a filesystem doesn’t work for me. LVM as storage pool working with vagrant and libvirt didn’t work a automagically as the filesystem backed storage pools. Eventually I want to move the storage array to another box and do NVMeoF anyway. So I figured I’d have my host run an NVMeoF target and have the VM’s do a NVMeoF client and script up the drive creation in Ansible.

It was very easy to prototype, nvmetcli made setting up the target exceptionally easy. Client side was even easier. However… no RDMA over the virtual network and performance sucked. I got 3GB/s and 1.5GB/s instead of 10/5.5. After some debug with iperf3 I found that this was a limit with the virtual network. Actually that was the speed limit for using localhost networking, too. So I abandoned that and just deal with some manual steps with LVM now which was pretty close to host performance.

Conclusions:

NVMeoF is definitely where I’m going eventually. NVMe is definitely the better protocol from a performance way and the tools are pretty easy. Getting RDMA working looked trivial assuming I had a NIC that supported it.

If you’re looking to do RDMA from within the VM I think there are two options:

Pass-through the PCIe device.
virtio-rdma patches VirtIO RDMA [LWN.net]

wallacebw · December 3, 2021, 1:32pm

KSMB, not yet, can look into over the next couple days.

The “Desktop” above is a dual boot with WIN10 and also has windows guests available under the linux OS. Can you provide a link or two for the passthrough Filesystem?

wallacebw · December 3, 2021, 5:15pm

So… I have KSMBD set up and working without RDMA… How do I enable RDMA? I don’t see anything here about it:
https : // github .com/cifsd-team/ksmbd-tools/blob/master/Documentation/configuration.txt

Throughput test with Linux Client

FYI: I’m using the ‘Peak’ switch statistics as gathering it’s easier than getting the stats on the host when using RDMA and should be apples:apples

Single instance

results

Command:   
     bonnie++  -d __PATH__  -n 0 -f -c4

Results:
     KSMBD (NO RDMA):
      -- Write to Server:   17026 Mbits/sec, 235589 packets/sec, 17% of line rate
      -- Read From Server:  16061 Mbits/sec, 222553 packets/sec, 16% of line rate

     KSMBD (RDMA):
      -- Write to Server:   20963 Mbits/sec, 632419 packets/sec, 20% of line rate
      -- Read From Server:  17958 Mbits/sec, 541960 packets/sec, 17% of line rate

     SAMBA (SMB)
      -- Write to Server:   22844 Mbits/sec, 316170 packets/sec, 22% of line rate
      -- Read From Server:  16915 Mbits/sec, 234754 packets/sec, 16% of line rate

     NFS (RDMA):
      -- Write to Server:   17808 Mbits/sec, 536497 packets/sec, 17% of line rate
      -- Read From Server:  11512 Mbits/sec, 364946 packets/sec, 11% of line rate

Parallel instances:

results

Command:   
     bonnie++ -p8
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &
     bonnie++ -y s -d __PATH__ -n 0 -f -c4 &

Results:
     KSMBD (NO RDMA):
      -- Write to Server:   16908 Mbits/sec, 233860 packets/sec, 16% of line raterate
      -- Read From Server:  25428 Mbits/sec, 352402 packets/sec, 25% of line rate

     KSMBD (RDMA):
      -- Write to Server:   39502 Mbits/sec, 1191122 packets/sec, 39% of line rate
      -- Read From Server:  41601 Mbits/sec, 1255649 packets/sec, 41% of line rate

     SAMBA (no RDMA)
      -- Write to Server:   22844 Mbits/sec, 316170 packets/sec, 22% of line rate
      -- Read From Server:  3461 Mbits/sec, 51796 packets/sec, 3% of line rate

     NFS (RDMA):
      -- Write to Server:   40182 Mbits/sec, 1209318 packets/sec, 40% of line rate
      -- Read From Server:  68411 Mbits/sec, 2168665 packets/sec, 68% of line rate

wendell · December 3, 2021, 9:34pm

with multichannel working you can probably get closer to line rate on samba w/o rdma but at a cost of much higher cpu utilization.

https://www.kernel.org/doc/html/latest//filesystems/cifs/ksmbd.html

Says it works with windows smb3 clients. Is it a windows client you’re testing from for rdma?

(regular smb client on linux implements rdma from windows servers fwiw if you didn’t know)

Enable all component prints
- ksmbd.control -d "all"
Enable a single component (see below)
- ksmbd.control -d "smb"
Run the command with the same component name again to disable it

I think the component is ‘rdma’ so you can print debuggin info about rdma

which might be what we want here

wallacebw · December 4, 2021, 2:49am

missed the ksmbd-tools stuff, thanks.

Got it installed and It seems to work on the server side:

❯sudo ksmbd.control -d all
[ksmbd.control/124284]: INFO:  auth vfs oplock ipc conn rdma

❯cat /sys/class/ksmbd-control/debug
smb auth vfs oplock ipc conn rdma

I was trying to test with PC booted in linux, but no luck mounting with the RDMA option:

FAILS:  sudo mount -t cifs  //server/temp temp -o vers=3.1.1,rdma
WORKS:  sudo mount -t cifs  //server/temp temp -o vers=3.1.1

I’ll test Windows soon using IOZone or similar. Also, will update the previous post with Stats leveraging SAMBA Multichannel and create a similar post for windows 10 stats (exc. NFS)

wallacebw · December 4, 2021, 4:27pm

I booted the desktop into windows 10 (21H1 build 190493.1348) and verified that windows saw the adapters as RDMA enabled

Get-NetAdapterRDMA

Name                      InterfaceDescription                     Enabled     PFC        ETS
----                      --------------------                     -------     ---        ---
100G_1                    Mellanox ConnectX-5 Ex Adapter           True        False      False
100G_2                    Mellanox ConnectX-5 Ex Adapter #2        True        False      False

I then ran the following against KSMBD:

diskspd.exe -b8M -c20G -d60 -L -o8 -Sr -t16 -W10 -v u:\iotest.dat

which resulted in approx. 32Gbit of traffic

I then enabled all KSMBD features:

sudo ksmbd.control -s

sudo ksmbd.control -d all
[ksmbd.control/700357]: INFO: b] [auth] [vfs] [oplock] [ipc] [conn] [rdma]

cat /sys/class/ksmbd-control/debug
[smb] [auth] [vfs] [oplock] [ipc] [conn] [rdma]

sudo ksmbd.mountd

Same Results and did not see RDMA used, so to ensure I’m reading the cryptic bracket [ ] system correctly, I ran it again after flipping the features:

sudo ksmbd.control -s
sudo ksmbd.control -d all
[ksmbd.control/705724]: INFO:  auth vfs oplock ipc conn rdma

cat /sys/class/ksmbd-control/debug
smb auth vfs oplock ipc conn rdma

sudo ksmbd.mountd

Once again same result.

I’ll test Samba and Multichannel later this weekend,
Brian

Jared_Hulbert · December 5, 2021, 1:04am

Can you try to enable all the ksmbd debug prints and then do the following operations followed by recording the output of ‘sudo dmesg -c > mount_rdma.txt’ then share the three output files here?

mount client
read from file
write to file

Jared_Hulbert · December 5, 2021, 1:19am

“did not see RDMA used” - how so? Just the speed or something else?

32Gbit with -w10, what about -w100 and -w0? Just to compare with the Linux?
I’m assuming they’re pretty close to the Linux client numbers.

What’s the reasoning with -Sr? I’m assuming thats a don’t care in this condition? Did you try other -S settings?

wallacebw · December 5, 2021, 5:09am

What’s the reasoning with -Sr? what about -w100 and -w0?

I added a warmup period of 10 seconds (-W10 - Capital W, not lowercase w) and disabled local caching on the client (-Sr) to ensure all the reported IO traversed the network. FYI: with no -w (lower case), the default is 100% reads.

REF: Command line and parameters · microsoft/diskspd Wiki · GitHub

compare with the Linux?

As for speed, the 32Gbit is better than the linux cifs mount (32Gbit vs 25.4Gbit), but I assume the IO profile is different from bonnie++ to diskspd. I was not capturing the same level of detail as with the linux results at this time, but wanted to provide a quick update; I’ll post more detailed result when I have the time to duplicate the tests.

“did not see RDMA used” - how so?

When using RDMA, the server does not reflect the network IO in the network counter within htop, as the data does not follow the typical stack.
Ex: When doing 70Gbit+ over RDMA (NFS), the network counters show <1mbit of traffic (ssh session, other misc traffic). FYI: This is also why I’m using switch statistics to measure traffic to get consistent values.

Also, not sure if it matters as its supposed to be automatic, but I’m just mounting the share with a windows “NET USE…” command (yes, I’m dating myself a little with that one, but prefer the cli)

As for upload, apparently I’m a ‘new user’ still, so see:
– dmesg KSMBD - Pastebin.com

wallacebw · December 5, 2021, 9:02pm

May have found something, but haven’t had a chance to mess with it yet:

Looking at the windows event log:

Event Details

LogName     : Microsoft-Windows-SmbClient/Connectivity
Id          : 30822
TimeCreated : 12/5/2021 3:06:31 PM
Level       : 4
Message     : Failed to establish an SMB multichannel network connection.

              Error: The transport connection attempt was refused by the remote system.

              Server name: nas.domain.com
              Server address: 10.0.1.100:445
              Client address: 10.0.1.117
              Instance name: \Device\LanmanRedirector
              Connection type: Wsk

              Guidance:
              This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when
              using an iWARP RDMA adapter can also cause this issue. Since the error occurred while trying to connect extra channels, it will not result in an application error.
              This event is for diagnostics only.

looking at the server (linux), it appears that I need to enable PFC / ETS:

update: Wrong, not required

Also: on the Dell OS10 switch

The S5148F is a great value at ~$1200 - $1400 US for a 48x25G + 6x600G switch, but it can’t run most open network OSs like Sonic, etc. What I’ve come up with so far but no luck with SMB RDMA… Does RDMA with SMB work over a 802.3ad LAGGs? (see ports 1/1/51-1/1/52 below)

switch config snippets

snippet
...
class-map type network-qos nqosmap_rdma
 match qos-group 3
!
policy-map type application policy-iscsi
!
policy-map type network-qos p_nqos_rdma
 !
 class nqosmap_rdma
  pause
  pfc-cos 3
!
system qos
 trust-map dot1p default
!
...
!
interface port-channel3
 description nas_10.0.1.100
 no shutdown
 switchport access vlan 1
 mtu 9216
 spanning-tree port type edge
!
...
!
interface ethernet1/1/49
 description Desktop
 no shutdown
 switchport access vlan 1
 mtu 9216
 flowcontrol receive off
 flowcontrol transmit off
 priority-flow-control mode on
 service-policy input type network-qos p_nqos_rdma
 spanning-tree port type edge
!
interface ethernet1/1/50
 description Desktop
 no shutdown
 switchport access vlan 1
 mtu 9216
 flowcontrol receive off
 flowcontrol transmit off
 priority-flow-control mode on
 service-policy input type network-qos p_nqos_rdma
 spanning-tree port type edge
!
interface ethernet1/1/51
 description nas_10.0.1.100
 no shutdown
 channel-group 3 mode active
 no switchport
 mtu 9216
 flowcontrol receive off
 flowcontrol transmit off
 priority-flow-control mode on
 service-policy input type network-qos p_nqos_rdma
!
interface ethernet1/1/52
 description nas_10.0.1.100
 no shutdown
 channel-group 3 mode active
 no switchport
 mtu 9216
 flowcontrol receive off
 flowcontrol transmit off
 priority-flow-control mode on
 service-policy input type network-qos p_nqos_rdma
!

I feel like I’m asking more questions than providing assistance (new to DCB in general), but once we get over this hurdle, hopefully I can validate findings or run tests, etc.

Jared_Hulbert · December 6, 2021, 6:37pm

I haven’t done much in this sphere since 56Gbit IB was a big deal, so take my advice with tablespoon of salt. But at first pass my gut says PFC is big deal for optimizing performance sometimes but shouldn’t make or break your ability to make the RDMA connection.

I think the big clue here is that the multichannel is failing. The ksmbd docs make mention of multichannel being a requirement for RDMA. But that both features are partially supported. I don’t understand why that should be the case at the protocol level, but it’s possible that’s an implementation quirk of the Windows client code.

Perhaps if you can get the multichannel working, you get to the part where the RDMA fails!

Jared_Hulbert · December 6, 2021, 6:44pm

What’s the hang up with the MLNX drivers? I never tried them with Arch, but in years past they weren’t that big of a deal.

wallacebw · December 6, 2021, 6:45pm

I Got Multichannel working on windows 10:

Get-SmbMultichannelConnection

Server Name Selected Client IP  Server IP  Client Interface Index Server Interface Index Client RSS Capable Client RDMA Capable
----------- -------- ---------  ---------  ---------------------- ---------------------- ------------------ -------------------
nas         True     10.0.1.10  10.0.1.100 19                     2                      False              False
nas         True     10.0.1.117 10.0.1.100 18                     2                      False              False

looking further into RDMA, I’m now getting the following event logs:

Windows Events

LogName     : Microsoft-Windows-SmbClient/Connectivity
Id          : 30822
TimeCreated : 12/6/2021 1:34:49 PM
Level       : 4
Message     : Failed to establish an SMB multichannel network connection.

              Error: The transport connection attempt was refused by the remote system.

              Server name: nas
              Server address: 10.0.1.100:445
              Client address: 10.0.1.117
              Instance name: \Device\LanmanRedirector
              Connection type: Wsk

              Guidance:
              This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when
              using an iWARP RDMA adapter can also cause this issue. Since the error occurred while trying to connect extra channels, it will not result in an application error.
              This event is for diagnostics only.

LogName     : Microsoft-Windows-SmbClient/Connectivity
Id          : 30804
TimeCreated : 12/6/2021 1:34:47 PM
Level       : 2
Message     : A network connection was disconnected.

              Instance name: \Device\LanmanRedirector
              Server name: \nas
              Server address: 10.0.1.100:445
              Connection type: Wsk
              InterfaceId: 19

              Guidance:
              This indicates that the client's connection to the server was disconnected.

              Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow
              Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects
              and poor performance.

The server only reports that there is no IPv6, I assume that’s not a requirement?

[ 4569.921757] ksmbd: Can't create socket for ipv6, try ipv4: -97
[ 4569.922726] ksmbd: Can't create socket for ipv6, try ipv4: -97
...
[ 4569.928955] ksmbd: smb_direct: init RDMA listener. cm_id=000000003af95ded

wallacebw · December 6, 2021, 6:47pm

MLNX drivers not a requirement, but it makes setting IRQ affinity easy in NUMA systems, eases FW updates and setting options.

Jared_Hulbert · December 6, 2021, 6:48pm

Thanks for the update.

The log you linked, what was the workflow you captured? Approximate timing would be nice. Example: At time 0 I mounted the fs, ~10s later I read, ~20s after reading I write a file.

wallacebw · December 6, 2021, 6:53pm

I’ll generate a new one, hold on.

Also, I find it odd that windows reports that the ‘client RDMA capable’ is false as shown above in the Get-SmbMultichannelConnection output, but Get-NetAdapterRDMA reports it as available?

Get-NetAdapterRDMA

Name                      InterfaceDescription                     Enabled     PFC        ETS
----                      --------------------                     -------     ---        ---
100G_1                    Mellanox ConnectX-5 Ex Adapter           True        False      False
100G_2                    Mellanox ConnectX-5 Ex Adapter #2        True        False      False

wallacebw · December 6, 2021, 6:58pm

dmsg.txt (64.8 KB)

6565.632221: Start ksmbd
6589.458732: Uploaded upload.txt
6591.818867: Downloaded dmsg.txt (old one)
6598.285087: stop ksmbd

dumb question: should I have an open port 5445, because I don’t?

Update: Answer – Nope, not needed.

Log · December 6, 2021, 8:22pm

I saw ZFS and “performance” mentioned together… figured I’d bring up for those that aren’t aware yet, when 3.0 comes out it should have DirectIO which can potentially resolve some bottlenecks utilizing NVMe drives, though it’s mostly writes that benefit the most.