How Can I Help with the new TRUENAS / 100G testing?

Okay so the “Can’t create socket” message seems to be benign. IF it was follow by a similar warning about IPv4 then you’d have a problem.
transport_tcp.c - fs/ksmbd/transport_tcp.c - Linux source code (v5.15.4) - Bootlin

But note that’s in setting up the tcp transport. The message about “ksmbd: smb_direct: init RDMA listener.” That only seems to happen when RDMA on the server is satisfactorily set up. There is nothing in the log that seems to indicate the server is the primary problem. In fact the server is rather passive in the matter. I don’t know the protocol well but the code suggest there are command blocks (maybe called PDUs) that are used to communicate between client and server. So following the smb2pdu.c smb2_read() path there is a flag check that determines whether or not to attempt the RDMA transfer.
https://elixir.bootlin.com/linux/v5.15.4/source/fs/ksmbd/smb2pdu.c#L6204

Now this ‘Channel’ flag doesn’t appear to be set anywhere in the ksmbd code but it is set in the cifs client codebase.
https://elixir.bootlin.com/linux/v5.15.4/source/fs/cifs/smb2pdu.c#L3941

However that does require the kernel has CIFS_SMB_DIRECT enabled at build. Can you grep through your .config for this kernel for SMB_DIRECT?

Something like the following

jared@pop-os:/data/workspaces/linux/linux-torvalds$ grep SMB_DIRECT /boot/config-5.11.0-7620-generic
# CONFIG_CIFS_SMB_DIRECT is not set

or

cat  /proc/config | grep SMB_DIRECT

or

zcat /proc/config.gz | grep SMB_DIRECT

If it’s not set… you could try a rebuilt kernel on the client side.

As an Arch user I assume you aren’t afraid of a little kernel rebuilding? If not I can help.

1 Like

The testing I was performing was with a windows 10 client, I’ll reboot into arch shortly and try there as well ac check the flags.

CIFS: VFS: CONFIG_CIFS_SMB_DIRECT is not enabled

Need to compile a kernel… stay tuned.

================================

UPDATE

Son of a #$%# :face_with_symbols_over_mouth:, it works with the linux cifs client!!!

uname -a
Linux desktop 5.15.6-arch2-1-smbdirect #1 SMP PREEMPT Mon, 06 Dec 2021 20:59:16 +0000 x86_64 GNU/Linux
sudo mount -t cifs  //server/temp temp -o vers=3.1.1,rdma (works)

I Updated the results in the earlier thread:

KSMBD (RDMA):
      -- Write to Server:   39502 Mbits/sec, 1191122 packets/sec, 39% of line rate
      -- Read From Server:  41601 Mbits/sec, 1255649 packets/sec, 41% of line rate

Not sure why the windows 10 client (same hardware) is failing

thanks again,

1 Like

Fantastic!

I do think your thoughts on the Get-SmbMultichannelConnection vs Get-NetAdapterRDMA is the issue. So it would seem that Get-SmbMultichannelConnection is reporting data from the established SMB link to your ksmbd server.

I think RDMA is likely enabled for the NIC on windows but when it mounts the SMB volume it decides the link is not eligible for RDMA.

If that’s correct then I think the question is why does the Windows client make that decision while the Linux client works?

But I guess you can try the pass-through filesystem to a windows VM? Wasn’t that one of the things @wendell suggested?

1 Like

Yes, another test for another day…

I may need to run the VM on the desktop vs server, but if we’re just worried about I/O throughput / latency, the Desktop is plenty fast :smiley:

FIO 
------------------
READ:  bw=17.8GiB/s (19.1GB/s), 17.8GiB/s-17.8GiB/s (19.1GB/s-19.1GB/s), io=10.0GiB (10.7GB), run=562-562msec
WRITE: bw=9679MiB/s (10.1GB/s), 9679MiB/s-9679MiB/s (10.1GB/s-10.1GB/s), io=10.0GiB (10.7GB), run=1058-1058msec

Thanks again

1 Like

Wow. That’s pretty much line speed right? FYI - If you get the settings right I used to get > line speed with some RW workloads because the link is duplexed. Not 2x but I don’t remember if it was 10% or 50% or what.

whats the cpu utilization like during this? mine was ~8%

Here's my best case with Linux client to KSMBD leveraging RDMA:

CIFS mount options:
vers=3.1.1,rdma


READ

Server CPU at ~4.8 to 5% over the 2 min run for Reads
Switch reported 69379 Mbits/sec, 2094465 packets/sec, 69% of line rate

raw data
fio --name READ --filename=/mnt/smb/temp/temp.file --rw=read --size=100g  --bs=1024k --ioengine=libaio  --iodepth=256 --direct=1 --runtime=120 --time_based --group_reporting --numjobs=64
READ: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 64 processes
Jobs: 30 (f=0): [E(1),f(2),E(1),f(1),E(1),_(2),f(2),E(1),_(1),f(1),_(1),f(1),E(1),f(1),_(1),f(2),E(1),f(1),_(1),f(1),_(1),f(1),E(1),_(2),f(1),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(4),_(2),E(2),f(3),_(1),f(2),_(1),f(1),_(2),E(2),_(1),E(1),_(1),f(1),_(1),f(1)][100.0%][r=23.0GiB/s][r=23.5k IOPS][eta 00m:00s]
READ: (groupid=0, jobs=64): err= 0: pid=9810: Mon Dec  6 19:55:27 2021
  read: IOPS=8063, BW=8063MiB/s (8455MB/s)(945GiB/120011msec)
    slat (usec): min=28, max=283264, avg=7935.08, stdev=15810.73
    clat (usec): min=698, max=4200.3k, avg=2004114.61, stdev=500509.44
     lat (msec): min=4, max=4206, avg=2012.05, stdev=501.60
    clat percentiles (msec):
     |  1.00th=[  751],  5.00th=[ 1167], 10.00th=[ 1368], 20.00th=[ 1603],
     | 30.00th=[ 1754], 40.00th=[ 1888], 50.00th=[ 2022], 60.00th=[ 2140],
     | 70.00th=[ 2265], 80.00th=[ 2433], 90.00th=[ 2635], 95.00th=[ 2802],
     | 99.00th=[ 3138], 99.50th=[ 3239], 99.90th=[ 3507], 99.95th=[ 3641],
     | 99.99th=[ 3977]
   bw (  MiB/s): min= 1914, max=23362, per=100.00%, avg=8064.85, stdev=53.53, samples=15038
   iops        : min= 1914, max=23362, avg=8064.72, stdev=53.53, samples=15038
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.02%
  lat (msec)   : 100=0.04%, 250=0.13%, 500=0.22%, 750=0.56%, 1000=1.64%
  lat (msec)   : 2000=46.27%, >=2000=51.09%
  cpu          : usr=0.02%, sys=30.02%, ctx=48693931, majf=0, minf=124845722
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.6%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=967664,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=8063MiB/s (8455MB/s), 8063MiB/s-8063MiB/s (8455MB/s-8455MB/s), io=945GiB (1015GB), run=120011-120011msec

WRITE

Server CPU at ~60 to 70% over the 2 min run for writes (mostly zfs_wr_iss)
Switch reported 40032 Mbits/sec, 1208070 packets/sec, 40% of line rate
– this is about the theoretical limit for my pool’s spinning disks

raw data
fio --name WRITE --filename=/mnt/smb/temp/temp.file --rw=write --size=100g  --bs=1024k --ioengine=libaio  --iodepth=256 --direct=1 --runtime=120 --time_based --group_reporting --numjobs=64
WRITE: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 64 processes
Jobs: 64 (f=64): [W(64)][100.0%][w=4376MiB/s][w=4375 IOPS][eta 00m:00s]
WRITE: (groupid=0, jobs=64): err= 0: pid=10383: Mon Dec  6 20:02:45 2021
  write: IOPS=4744, BW=4744MiB/s (4975MB/s)(557GiB/120132msec); 0 zone resets
    slat (usec): min=43, max=692942, avg=13413.56, stdev=28613.19
    clat (usec): min=476, max=6221.5k, avg=3382036.28, stdev=809330.73
     lat (usec): min=940, max=6221.8k, avg=3395450.12, stdev=811011.94
    clat percentiles (msec):
     |  1.00th=[  852],  5.00th=[ 2022], 10.00th=[ 2433], 20.00th=[ 2802],
     | 30.00th=[ 3004], 40.00th=[ 3239], 50.00th=[ 3406], 60.00th=[ 3608],
     | 70.00th=[ 3809], 80.00th=[ 4044], 90.00th=[ 4329], 95.00th=[ 4597],
     | 99.00th=[ 5134], 99.50th=[ 5269], 99.90th=[ 5604], 99.95th=[ 5738],
     | 99.99th=[ 5940]
   bw (  MiB/s): min=  642, max=14378, per=99.89%, avg=4738.86, stdev=35.92, samples=14892
   iops        : min=  642, max=14377, avg=4737.93, stdev=35.92, samples=14892
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.02%, 20=0.02%, 50=0.05%
  lat (msec)   : 100=0.05%, 250=0.15%, 500=0.24%, 750=0.29%, 1000=0.41%
  lat (msec)   : 2000=3.54%, >=2000=95.23%
  cpu          : usr=0.16%, sys=16.26%, ctx=28707074, majf=0, minf=68196247
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.3%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,569928,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
  WRITE: bw=4744MiB/s (4975MB/s), 4744MiB/s-4744MiB/s (4975MB/s-4975MB/s), io=557GiB (598GB), run=120132-120132msec

Here's my best case with Linux client to NFS leveraging RDMA:

NFS mount options: rsize=1048576,wsize=1048576,vers=4.2,proto=rdma,port=20049,noatime,nodiratime


READ

Server CPU at ~11.5 - 11.8% over the 2 min run for Reads
Switch reported 99911 Mbits/sec, 3010898 packets/sec, 99% of line rate

raw data
READ: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 64 processes
READ: Laying out IO file (1 file / 102400MiB)
Jobs: 64 (f=64): [R(64)][20.7%][r=223MiB/s][r=223 IOPS][eta 07m:46s]
READ: (groupid=0, jobs=64): err= 0: pid=11272: Mon Dec  6 20:11:42 2021
  read: IOPS=11.3k, BW=11.0GiB/s (11.8GB/s)(1334GiB/121408msec)
    slat (usec): min=24, max=23176, avg=61.64, stdev=241.72
    clat (msec): min=118, max=5353, avg=1452.38, stdev=361.26
     lat (msec): min=121, max=5353, avg=1452.45, stdev=361.40
    clat percentiles (msec):
     |  1.00th=[ 1401],  5.00th=[ 1401], 10.00th=[ 1401], 20.00th=[ 1401],
     | 30.00th=[ 1401], 40.00th=[ 1401], 50.00th=[ 1401], 60.00th=[ 1401],
     | 70.00th=[ 1401], 80.00th=[ 1401], 90.00th=[ 1401], 95.00th=[ 1418],
     | 99.00th=[ 4279], 99.50th=[ 4665], 99.90th=[ 5067], 99.95th=[ 5201],
     | 99.99th=[ 5336]
   bw (  MiB/s): min= 4948, max=11776, per=100.00%, avg=11587.24, stdev= 9.02, samples=14912
   iops        : min= 4948, max=11776, avg=11587.24, stdev= 9.02, samples=14912
  lat (msec)   : 250=0.04%, 500=0.03%, 750=0.11%, 2000=97.93%, >=2000=1.89%
  cpu          : usr=0.03%, sys=1.10%, ctx=1387859, majf=0, minf=7724180
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1366234,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
   READ: bw=11.0GiB/s (11.8GB/s), 11.0GiB/s-11.0GiB/s (11.8GB/s-11.8GB/s), io=1334GiB (1433GB), run=121408-121408msec

WRITE

Server CPU at 10% with regular spikes to 35% when zfs flushed over the 2 min run for writes
switch reported 98213 Mbits/sec, 2948446 packets/sec, 98% of line rate
– Not sure how this is possible; I’m guessing that ZFS was caching it all (100GB) and it was being updated faster than it could flush

raw data
fio --name write --filename=/mnt/nfs/nas/temp.file --rw=write --size=100g  --bs=1024k --ioengine=libaio  --iodepth=256 --direct=1 --runtime=120 --time_based --group_reporting --numjobs=64
write: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=256
...
fio-3.28
Starting 64 processes
Jobs: 48 (f=48): [W(33),_(10),W(1),_(3),E(1),_(1),E(1),W(14)][20.9%][w=4076MiB/s][w=4075 IOPS][eta 07m:43s]
write: (groupid=0, jobs=64): err= 0: pid=11787: Mon Dec  6 20:17:49 2021
  write: IOPS=11.4k, BW=11.1GiB/s (11.9GB/s)(1351GiB/121445msec); 0 zone resets
    slat (usec): min=34, max=23737, avg=87.29, stdev=70.26
    clat (msec): min=20, max=2846, avg=1435.77, stdev=134.87
     lat (msec): min=20, max=2846, avg=1435.86, stdev=134.84
    clat percentiles (msec):
     |  1.00th=[  953],  5.00th=[ 1418], 10.00th=[ 1418], 20.00th=[ 1418],
     | 30.00th=[ 1418], 40.00th=[ 1418], 50.00th=[ 1435], 60.00th=[ 1452],
     | 70.00th=[ 1452], 80.00th=[ 1469], 90.00th=[ 1485], 95.00th=[ 1485],
     | 99.00th=[ 1552], 99.50th=[ 2165], 99.90th=[ 2635], 99.95th=[ 2702],
     | 99.99th=[ 2802]
   bw (  MiB/s): min= 8727, max=15632, per=100.00%, avg=11393.26, stdev= 9.59, samples=15360
   iops        : min= 8727, max=15620, avg=11389.15, stdev= 9.58, samples=15360
  lat (msec)   : 50=0.01%, 100=0.09%, 250=0.27%, 500=0.22%, 750=0.22%
  lat (msec)   : 1000=0.23%, 2000=98.33%, >=2000=0.63%
  cpu          : usr=0.69%, sys=0.93%, ctx=1393157, majf=0, minf=5328390
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.7%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,1383709,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
  WRITE: bw=11.1GiB/s (11.9GB/s), 11.1GiB/s-11.1GiB/s (11.9GB/s-11.9GB/s), io=1351GiB (1451GB), run=121445-121445msec
1 Like

you rock, you’re getting a shout out in this vid. You did a lot of heavy lifting.

Is it accurate to say if you’re willing to do this you can get great performance and low cpu overhead?

2 Likes

Thanks! I think the short answer is yes, if you can leverage RDMA. Additionally, RDMA has more impact than protocol. That said, I’ve done more optimization with NFS than SMB.

I’ll Update with random IO later. Also, I’m happy to support with the windows server VM testing. were you refering to:
https://wiki.archlinux.org/title/QEMU#Using_filesystem_passthrough_and_VirtFS

Linux <-> Linux Finding Summary:


Sequential IO
Engine RDMA Seq. Read Seq. Read CPU Seq. Write Seq. Write CPU Notes
KSMBD YES 69.3 Gbit ~5% 40.0 Gbit ~65% Scales to parallel jobs, large # of outstanding IO well
KSMBD NO 19.9 Gbit ~5.5% 16.4 Gbit ~30%
SAMBA NO 19.7 Gbit ~6% 16.8 Gbit ~35% comparable to ksmbd, but less efficient
better enterprise features ex. acls
NFS YES 99.9 Gbit ~11.7% 98 Gbit ~40% Write figures reflect accurate network IO, but CPU figure is suspect as I believe not all IO is written to disk, limiting time zfs spends calculating hash and parity (arc churn)
NFS NO 35.5 Gbit ~10% 21.4 Gbit ~27%

Random IO
Engine RDMA Read IOPS Read CPU Write IOPS Write CPU R/W IOPS R/W CPU Notes
KSMBD YES 8483 5.2% 2331 ~60% R: 2241
W: 2243
~60%
KSMBD NO 9707 5.5% 1027 ~55% R: 1503
W: 1509
~59%
SAMBA NO 9760 6% 1926 ~55% R: 1872
W: 1877
~57%
NFS YES 39.1k 50.5% 2249 ~60% R: 2295
W: 2300
~62%
NFS NO 33.8k 40% 2285 ~60% R: 2286
W: 2291
~70%

Linux <-> Windows server VM Finding Summary:


Setup:

I enabled SR_IOV & IOMMU on the server and created a QEMU/libvirt VM passing through a virtual mellanox pci device for the NIC. I then created a 200G RAW (not qcow2) storage device stored on a dir based storage pool backed by the same ZFS pool as the other testing.

VM SPECS: 8 vCPU | 8GB Ram

NOTE: I have captured the CPU figures from the HOST cpu consumption, not guest to better align with other results.

Sequential IO
Engine RDMA Seq. Read Seq. Read CPU Seq. Write Seq. Write CPU Notes
SMB YES 81.2 Gbit ~25% 23.0 Gbit ~73% Scales to parallel jobs, large # of outstanding IO well
SMB NO 19.6 Gbit ~21% 20.9 Gbit ~30%

Random IO
Engine RDMA Read IOPS Read CPU Write IOPS Write CPU R/W IOPS R/W CPU Notes
SMB YES 37.1k ~63% 2359 ~70% R: 2345
W: 2354
80%
SMB NO 37.1k 67% 2354 ~77% R: 2362
W: 2369
~75%

Windows <-> Windows server VM Finding Summary:


Setup:

Note: Same VM setup and cpu methodology as linux <-> windows VM

Sequential IO
Engine RDMA Seq. Read Seq. Read CPU Seq. Write Seq. Write CPU Notes
SMB YES 100Gbit ~32% 52.9 Gbit ~79% Scales to parallel jobs, large # of outstanding IO well
SMB YES - 2 NICs 184 Gbit!!!! ~50% not tested ~n/a Multi-channel + RDMA (WOW)
SMB NO 100 Gbit ~26% 35.1 Gbit ~80% SMB Multichannel works well here

Random IO
Engine RDMA Read IOPS Read CPU Write IOPS Write CPU R/W IOPS R/W CPU Notes
SMB YES 36.0k 72% 2354 ~77% R: 2282
W: 2280
~50%
SMB no 35.7k ~70% 2293 ~47% R: 2279
W: 2277
51%
2 Likes

I am having trouble with windows client > linux server ksmbd enabling rdma. rdma is working windows client > windows server and linux client > linux server and linux client > windows server but not windows client > linux server.

I know initially you had the same problem, but did you ever resolve the windows client > linux server ksmbd rdma problem?

Yep, Redhat has some more stuff to make this go last time I checked also

FYI: updated the random IO above.

not resolved yet. Have you done a packet capture of a working RDMA windows → windows SMB handshake? I don’t have a windows box setup to generate one yet.

We could compare the working handshake (win → win) against a failing one (win → ksmbd). I bet windows is expecting some flag that ksmbd isn’t passing back.

are you seeing the same symptoms as me?

What I see on the winows guest, as noted above, is:

The adapters support RDMA:

Get-NetAdapterRDMA

Name                      InterfaceDescription                     Enabled     PFC        ETS
----                      --------------------                     -------     ---        ---
100G_1                    Mellanox ConnectX-5 Ex Adapter           True        False      False
100G_2                    Mellanox ConnectX-5 Ex Adapter #2        True        False      False

but as @Jared_Hulbert pointed out, it appears that it’s not being selected for the session.

Get-SmbMultichannelConnection

Server Name Selected Client IP  Server IP  Client Interface Index Server Interface Index Client RSS Capable Client RDMA Capable
----------- -------- ---------  ---------  ---------------------- ---------------------- ------------------ -------------------
nas         True     10.0.1.10  10.0.1.100 19                     2                      False              False
nas         True     10.0.1.117 10.0.1.100 18                     2                      False              False

also seeing the wimdows events noted above: How Can I Help with the new TRUENAS / 100G testing? - #12 by wallacebw

1 Like

Oh! If y’all can get a packet capture for the mount on windows > windows vs windows > linux… I think we can figure out how to get the linux server to play nice, it feels close. Good news is I’m pretty sure all the handshake traffic is over TCP so something like wireshark should work. But lets isolate the capture really well to just the mount. From the server logs I suspect there is a top level directory scan that happens as part of the mount or as a consequence anyway. I think it would safer to server up an empty directory to avoid extra noise in the capture.

1 Like

See attached PCAP for a windows 10 to ksmbd without RDMA (Only Captured the Net use … command)

ksmbdwindows_to_ksmbd_failed_rdma.zip (2.9 KB)

Don’t see much here…

This is a second capture that has no filters applied, I map the drive (net use) and then open the drive in explorer then stop the capture (still < 200 frames)
windows_to_ksmbd_failed_rdma_Unfiltered.zip (8.2 KB)

@Jared_Hulbert take a look at frame #51, I have expanded the LACP (200Gbit) and the two members (100Gbit mlnx_5).

KSMBD (10.0.1.100) responds that RDMA (and RSS) is not supported

Frame 51 Raw Data
Frame 51: 1234 bytes on wire (9872 bits), 1234 bytes captured (9872 bits) on interface \Device\NPF_{BB601843-EB9D-43CC-91DF-D1146AEEA3B6}, id 0
Ethernet II, Src: 9a:f8:db:78:76:97 (9a:f8:db:78:76:97), Dst: Mellanox_a1:12:c6 (04:3f:72:a1:12:c6)
Internet Protocol Version 4, Src: 10.0.1.100, Dst: 10.0.1.10
Transmission Control Protocol, Src Port: 445, Dst Port: 54349, Seq: 950, Ack: 1270, Len: 1180
NetBIOS Session Service
SMB2 (Server Message Block Protocol version 2)
    SMB2 Header
    Ioctl Response (0x0b)
        StructureSize: 0x0031
        Reserved: 0000
        Function: FSCTL_QUERY_NETWORK_INTERFACE_INFO (0x001401fc)
        GUID handle
        Flags: 0x00000000
        Reserved: 00000000
        Blob Offset: 0x00000070
        Blob Length: 0
        In Data: NO DATA
        Blob Offset: 0x00000070
        Blob Length: 1064
        Out Data
            Network Interface, 200.0 GBits/s, IPv4: 10.0.1.100
                Next Offset: 0x00000098
                Interface Index: 2
                Interface Cababilities: 0x00000000
                    .... .... .... .... .... .... .... ..0. = RDMA: This interface does not support RDMA
                    .... .... .... .... .... .... .... ...0 = RSS: This interface does not support RSS
                RSS Queue Count: 0
                Link Speed: 200000000000, 200.0 GBits/s
                Socket Address, IPv4: 10.0.1.100
                    Socket Family: 2
                    Socket Port: 0
                    Socket IPv4: 10.0.1.100
            Network Interface, 2.0 GBits/s, IPv4: 172.17.1.100
            Network Interface, 4294967.0 GBits/s, IPv4: 169.254.3.1
            Network Interface, 1.0 GBits/s, IPv4: 0.0.0.0
            Network Interface, 1.0 GBits/s, IPv4: 0.0.0.0
            Network Interface, 100.0 GBits/s, IPv4: 0.0.0.0
                Next Offset: 0x00000098
                Interface Index: 7
                Interface Cababilities: 0x00000000
                    .... .... .... .... .... .... .... ..0. = RDMA: This interface does not support RDMA
                    .... .... .... .... .... .... .... ...0 = RSS: This interface does not support RSS
                RSS Queue Count: 0
                Link Speed: 100000000000, 100.0 GBits/s
                Socket Address, IPv4: 0.0.0.0
            Network Interface, 100.0 GBits/s, IPv4: 0.0.0.0
                Next Offset: 0x00000000
                Interface Index: 8
                Interface Cababilities: 0x00000000
                    .... .... .... .... .... .... .... ..0. = RDMA: This interface does not support RDMA
                    .... .... .... .... .... .... .... ...0 = RSS: This interface does not support RSS
                RSS Queue Count: 0
                Link Speed: 100000000000, 100.0 GBits/s
                Socket Address, IPv4: 0.0.0.0

I also added some formatting to the thread, it was getting a little unwieldy and making my eye twitch :woozy_face:

1 Like

Hmm. I totally agree with your take on frame 51.

What’s really breaking my brain is that 51 is a response from a FSCTL_QUERY_NETWORK_INTERFACE_INFO ioctl request from frame 46.

https://elixir.bootlin.com/linux/latest/source/fs/ksmbd/transport_rdma.c#L2053

The deal is ksmbd_rdma_capable_netdev() must be responding false, but it doesn’t seem to depend on anything in the actual ksmbd code. It just jumps into the infiniband code. I can’t figure out how anything about the cifs client can affect this response vs Windows. I was hoping to find something in the cifs client that was ignoring this for rdma support. But I don’t see it, the cifs client seems to respect the response.

Unfortunately, logically whats left if that the Windows client is setting up the link in such a way that the ksmbd side sees the link has un-rdma-able. Which makes no sense because the nfs connection is doing RDMA on Windows, right? And I can’t figure out how nfs server does the same calculations. That’s gonna take some stuff I can’t do with my set up.

Maybe you can look at some details about the link from the server side perspective Linux vs Windows.

We can look at the output of /proc/fs/cifs/DebugData. Maybe other files there. Enable cifs debugging in dmesg, and get a similar trace from a Linux client.

But maybe we should look at this same sort of trace when doing an NFS operation?

Or… you might be able try this hack. It just makes the ioctl always say it’s RDMA capable. This is of course a horrible idea for anything real. But it probably won’t do anything really bad.
stupid.patch (501 Bytes)

FYI: ALL successful RDMA tests have been from a linux client

Thanks, I’ll look at the patch today.

Thought: in linux we are specifying the use of RDMA via mount options, where as windows is negotiating its use.

Here are two captures from a linux client:


RDMA

Note that I don't see anything related to SMB at all in this capture

linux_rdma.zip (1011 Bytes)

Capture Sequence
Start tcpdump: 
	sudo tcpdump --interface=1 -s 65535 -w ~/linux_rdma.pcap --immediate-mode --no-optimize
Run the test
	sudo mount -t cifs //nas/temp temp -o vers=3.1.1,rdma && cd temp && ll && touch test.file && cd .. && sudo umount temp
stop tcpdump

NO RDMA

Note that in Frame 38 we are still seeing RDMA as unsupported. I wonder if RDMA has been implemented, but RDMA negotiation has not?

linux_no_rdma.zip (5.1 KB)

Capture Sequence
Start tcpdump: 
	sudo tcpdump --interface=1 -s 65535 -w ~/linux_rdma.pcap --immediate-mode --no-optimize
Run the test
	sudo mount -t cifs //nas/temp temp -o vers=3.1.1 && cd temp && ll && touch test.file && cd .. && sudo umount temp
stop tcpdump

@Jared_Hulbert I applied your kernel patch and although the patch worked based on a packet capture, the rdma from windows 10 didn’t. I did some digging and found out that our issue may be related to Microsoft limiting SMB DIRECT to windows 10 pro WORKSTATION only, not windows 10 pro . :rage: REF:

Let me see if I can find a cheap greymarket key as a 1st step. :roll_eyes:


update

I found a KMS key I can use until the activation window expires.

Now the client reports that it is multichannel capable

Get-SmbMultichannelConnection
Get-SmbMultichannelConnection

Server Name Selected Client IP Server IP  Client Interface Index Server Interface Index Client RSS Capable Client RDMA
                                                                                                           Capable
----------- -------- --------- ---------  ---------------------- ---------------------- ------------------ ------------
nas         True     10.0.1.10 10.0.1.100 19                     2                      False              True

with the above, I am able to briefly open an RDMA connection to the server (perfmon shows ~ 2Kbit for about two seconds, then nothing when transferring a 20GB file.

I was able to catch a netstat RDMA connection:

netstat.exe -xan
netstat.exe -xan

Active NetworkDirect Connections, Listeners, SharedEndpoints

  Mode   IfIndex Type           Local Address          Foreign Address        PID

  Kernel      19 Connection     10.0.1.10:65260        172.17.1.100:445       0
  Kernel      19 Connection     10.0.1.10:65516        172.17.1.100:445       0

It appears that the patch you built works too well :grin:, as the windows box is trying to open an RDMA connection to an interface that does not support it on a different subnet. I am going to admin down the other interfaces.

That didn’t work, but it errored faster… progress?

The last thing I can think of is it looks like Microsoft doesn’t support NIC teaming / bonding for RDMA. I’ll try breaking the team and using a single interface… stay tuned.

I didn’t know Win10 Pro Workstation was a thing. Come on!

That Linux behavior is quite unexpected. I’m used to the handshake stuff being done over TCP then having the RDMA engage. Looks like the Linux client just decided a transport is a transport. I don’t think that changes the logical flow. My hunch is that the same ioctls and SMB packets are being sent over the RDMA. The difference is the flags are set right in the Linux case. The infiniband code would look plain silly if it tried to send the client a flag saying it wasn’t RDMA capable over the RDMA link.

Therefore, I still think the most likely difference Windows vs Linux has to do with the state of the Link before the connection is made.

I like trying without teaming. I recall seeing something that was concerning there but I can’t seem to remember what it was.

Oh yeah I meant to warn you that patch has the unnatural ability to make all your links RDMA capable. :wink:

More of the same… i have:

  • Disabled all interfaces except the one (unbonded) interface on the ksmbd server
  • Removed the lacp bond on the server and adjusted the switch config accordingly
  • Tried changing the windows adapter network direct settings to roce, rocev2 and not present “default”
screenshot

Capture

RDMA did not work after breaking any of the above. I captured a network dump of one of the attempts:

  • Mapping the network drive
  • Reading a small file
  • Writing the file back

windowsWorkstation_to_ksmbd_failed_rdma.zip (18.8 KB)

Additionally, I captured the dmsg logs of the same interaction.
dmsg.txt (48.8 KB)

Windows reports the following in the event log:

Events
LogName     : Microsoft-Windows-SmbClient/Connectivity
Id          : 30818
TimeCreated : 12/8/2021 3:01:01 PM
Level       : 3
Message     : RDMA interfaces are available but the client failed to connect to the server over RDMA transport.

              Server name: \nas

              Guidance:
              Both client and server have RDMA (SMB Direct) adaptors but there was a problem with the connection and the client had to fall back to using TCP/IP SMB (non-RDMA).

LogName     : Microsoft-Windows-SmbClient/Connectivity
Id          : 30822
TimeCreated : 12/8/2021 3:01:01 PM
Level       : 4
Message     : Failed to establish an SMB multichannel network connection.

              Error: The transport connection attempt was refused by the remote system.

              Server name: nas
              Server address: 10.0.1.100:445
              Client address: 10.0.1.10
              Instance name: \Device\LanmanRedirector
              Connection type: Rdma

              Guidance:
              This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when using an iWARP RDMA adapter can also cause this issue. Since the error occurred while
              trying to connect extra channels, it will not result in an application error. This event is for diagnostics only.

LogName     : Microsoft-Windows-SmbClient/Connectivity
Id          : 30822
TimeCreated : 12/8/2021 3:01:01 PM
Level       : 4
Message     : Failed to establish an SMB multichannel network connection.

              Error: The transport connection attempt was refused by the remote system.

              Server name: nas
              Server address: 10.0.1.100:445
              Client address: 10.0.1.10
              Instance name: \Device\LanmanRedirector
              Connection type: Rdma

              Guidance:
              This indicates a problem with the underlying network or transport, such as with TCP/IP, and not with SMB. A firewall that blocks TCP port 445, or TCP port 5445 when using an iWARP RDMA adapter can also cause this issue. Since the error occurred while
              trying to connect extra channels, it will not result in an application error. This event is for diagnostics only.

I need to walk away from this for a while… my head hurts from banging it against my desk.

Something tells me I need to set-up a SPAN port on the switch to see the RDMA frames :sigh:

@wendell Any chance you have been able to capture a successful RDMA SMB session from windows? Also did you see the note about needing windows 10 pro WORKSTATION to enable RDMA in windows 10? It may be easier to use a windows server evaluation copy as the client…

2 Likes

Um… I have a really good feeling about this one. @wallacebw When you can muster one more test. I think we need to enable RSS.

From the ksmbd docs

SMB direct(RDMA)               Partially Supported. SMB3 Multi-channel is
                                required to connect to Windows client.

Your Get-SmbMultichannelConnection results have RDMA set but not RSS, right so if you look here:
https://docs.microsoft.com/en-us/previous-versions/windows/it-pro/windows-server-2012-r2-and-2012/dn610980(v=ws.11)#requirements-for-using-smb-multichannel

Looks like RSS is one of the pre-reqs for SMB Multichannel. I don’t know what exactly that all means. But if it can work on the NICs you have, then it could solve the error messages you’re seeing.

I’ve updated the patch to even higher levels of unadvised behaviors. Now I’m adding the RSS bit too.
stupid2.patch (588 Bytes)

2 Likes

I haven’t yet because I dont know if its capturing the crypto/password/token part of that also. I will setup a test domain for testing with windows server.

In the past I have been told that Windows 10 CLIENT to Windows 2019 server RDMA is fine. But Win 10 being a server to any other machine RDMA is disabled.

It would be hilariously short-sighted of microsoft to block rdma on regular win 10 when it is a client for something else. Back to Xsan type technology I guess LOL.

https://docs.microsoft.com/en-us/archive/blogs/josebda/deploying-windows-server-2012-with-smb-direct-smb-over-rdma-and-the-mellanox-connectx-3-using-10gbe40gbe-roce-step-by-step

1 Like