How Can I Help with the new TRUENAS / 100G testing?

wallacebw · December 8, 2021, 9:56pm

short-sighted, microsoft… never…

ref: https://www.microsoft.com/en-us/windowsforbusiness/compare

wallacebw · December 8, 2021, 11:40pm

no dice…

the patch does briefly work, and I was able to transfer some files, but was unable to make an RDMA connection (or at least if i did, the kernel wasn’t happy about it ). That stated, ksmbd would freeze up and basically become a zombie process did not respond to kill -9. Here’s what I saw in the journal:

Summary

Dec 08 17:42:31 nas kernel: ksmbd: Can't create socket for ipv6, try ipv4: -97
Dec 08 17:42:31 nas kernel: ksmbd: Can't create socket for ipv6, try ipv4: -97
Dec 08 17:43:15 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:43:15 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:43:15 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:44:44 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:11 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:11 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:11 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:16 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:33 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: SRVSVC: unsupported INVOKE method 21
Dec 08 17:45:33 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: SRVSVC: unsupported INVOKE method 21
Dec 08 17:45:39 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:39 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:39 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:45:39 nas kernel: ------------[ cut here ]------------
Dec 08 17:45:39 nas kernel: kernel BUG at mm/slub.c:379!
Dec 08 17:45:39 nas kernel: invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Dec 08 17:45:39 nas kernel: CPU: 7 PID: 416 Comm: kworker/7:2 Tainted: P           OE     5.15.6-arch2-1-forcerdma #1 b98c72527cd6d211cd8bbaff0b4cfab7889374>
Dec 08 17:45:39 nas kernel: Hardware name: Supermicro Super Server/H12SSL-C, BIOS 2.3 10/20/2021
Dec 08 17:45:39 nas kernel: Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
Dec 08 17:45:39 nas kernel: RIP: 0010:__slab_free+0x296/0x480
Dec 08 17:45:39 nas kernel: Code: ff 80 7c 24 5b 00 0f 89 ae fe ff ff 48 8d 65 d8 4c 89 e6 4c 89 f7 ba 01 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 5a 11 00>
Dec 08 17:45:39 nas kernel: RSP: 0018:ffffaff349753cd0 EFLAGS: 00010246
Dec 08 17:45:39 nas kernel: RAX: ffff90aa10423990 RBX: ffff90aa10423990 RCX: ffff90aa10423990
Dec 08 17:45:39 nas kernel: RDX: 0000000082000170 RSI: ffffecb4844108c0 RDI: ffff90aa00042200
Dec 08 17:45:39 nas kernel: RBP: ffffaff349753d70 R08: 0000000000000001 R09: ffffffffc124c3ae
Dec 08 17:45:39 nas kernel: R10: ffff90aa10423990 R11: 0000000000000000 R12: ffffecb4844108c0
Dec 08 17:45:39 nas kernel: R13: ffff90aa10423990 R14: ffff90aa00042200 R15: ffff90aa10423990
Dec 08 17:45:39 nas kernel: FS:  0000000000000000(0000) GS:ffff90e84e9c0000(0000) knlGS:0000000000000000
Dec 08 17:45:39 nas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 08 17:45:39 nas kernel: CR2: 00007f91af7013f0 CR3: 000000013ea6a000 CR4: 0000000000350ee0
Dec 08 17:45:39 nas kernel: Call Trace:
Dec 08 17:45:39 nas kernel:  <TASK>
Dec 08 17:45:39 nas kernel:  ? sock_def_readable+0x3c/0x80
Dec 08 17:45:39 nas kernel:  ? __netlink_sendskb+0x64/0x90
Dec 08 17:45:39 nas kernel:  ? ksmbd_ipc_logout_request+0x9c/0xd0 [ksmbd f32740ff0b4cd2637a7ca0c1c3c4338b922dc778]
Dec 08 17:45:39 nas kernel:  ? ksmbd_free_user+0x1e/0x30 [ksmbd f32740ff0b4cd2637a7ca0c1c3c4338b922dc778]
Dec 08 17:45:39 nas kernel:  kfree+0x384/0x400
Dec 08 17:45:39 nas kernel:  ksmbd_free_user+0x1e/0x30 [ksmbd f32740ff0b4cd2637a7ca0c1c3c4338b922dc778]
Dec 08 17:45:39 nas kernel:  smb2_sess_setup+0xcf4/0xe90 [ksmbd f32740ff0b4cd2637a7ca0c1c3c4338b922dc778]
Dec 08 17:45:39 nas kernel:  handle_ksmbd_work+0x143/0x3c0 [ksmbd f32740ff0b4cd2637a7ca0c1c3c4338b922dc778]
Dec 08 17:45:39 nas kernel:  process_one_work+0x1e5/0x3c0
Dec 08 17:45:39 nas kernel:  worker_thread+0x50/0x3c0
Dec 08 17:45:39 nas kernel:  ? process_one_work+0x3c0/0x3c0
Dec 08 17:45:39 nas kernel:  kthread+0x12f/0x160
Dec 08 17:45:39 nas kernel:  ? set_kthread_struct+0x50/0x50
Dec 08 17:45:39 nas kernel:  ret_from_fork+0x1f/0x30
Dec 08 17:45:39 nas kernel:  </TASK>
Dec 08 17:45:39 nas kernel: Modules linked in: cmac nls_utf8 ksmbd crc32_generic rpcrdma ib_umad rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_i>
Dec 08 17:45:40 nas kernel: ---[ end trace 32de12428ea34941 ]---
Dec 08 17:45:40 nas kernel: RIP: 0010:__slab_free+0x296/0x480
Dec 08 17:45:40 nas kernel: Code: ff 80 7c 24 5b 00 0f 89 ae fe ff ff 48 8d 65 d8 4c 89 e6 4c 89 f7 ba 01 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d e9 5a 11 00>
Dec 08 17:45:40 nas kernel: RSP: 0018:ffffaff349753cd0 EFLAGS: 00010246
Dec 08 17:45:40 nas kernel: RAX: ffff90aa10423990 RBX: ffff90aa10423990 RCX: ffff90aa10423990
Dec 08 17:45:40 nas kernel: RDX: 0000000082000170 RSI: ffffecb4844108c0 RDI: ffff90aa00042200
Dec 08 17:45:40 nas kernel: RBP: ffffaff349753d70 R08: 0000000000000001 R09: ffffffffc124c3ae
Dec 08 17:45:40 nas kernel: R10: ffff90aa10423990 R11: 0000000000000000 R12: ffffecb4844108c0
Dec 08 17:45:40 nas kernel: R13: ffff90aa10423990 R14: ffff90aa00042200 R15: ffff90aa10423990
Dec 08 17:45:40 nas kernel: FS:  0000000000000000(0000) GS:ffff90e84e9c0000(0000) knlGS:0000000000000000
Dec 08 17:45:40 nas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 08 17:45:40 nas kernel: CR2: 00007f91af7013f0 CR3: 000000013ea6a000 CR4: 0000000000350ee0
Dec 08 17:46:00 nas kernel: ------------[ cut here ]------------
Dec 08 17:46:00 nas kernel: WARNING: CPU: 9 PID: 319 at fs/ksmbd/smb2pdu.c:2045 smb2_session_logoff+0xd8/0xe0 [ksmbd]
Dec 08 17:46:00 nas kernel: Modules linked in: cmac nls_utf8 ksmbd crc32_generic rpcrdma ib_umad rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_i>
Dec 08 17:46:00 nas kernel: CPU: 9 PID: 319 Comm: kworker/9:1 Tainted: P      D    OE     5.15.6-arch2-1-forcerdma #1 b98c72527cd6d211cd8bbaff0b4cfab7889374>
Dec 08 17:46:00 nas kernel: Hardware name: Supermicro Super Server/H12SSL-C, BIOS 2.3 10/20/2021
Dec 08 17:46:00 nas kernel: Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
Dec 08 17:46:00 nas kernel: RIP: 0010:smb2_session_logoff+0xd8/0xe0 [ksmbd]
Dec 08 17:46:00 nas kernel: Code: 00 48 8b 7b 08 e8 38 45 ff ff 48 c7 43 08 00 00 00 00 48 8b 45 00 c7 40 40 04 00 00 00 31 c0 5b 5d 41 5c 41 5d 31 f6 89 f7>
Dec 08 17:46:00 nas kernel: RSP: 0018:ffffaff348717e38 EFLAGS: 00010287
Dec 08 17:46:00 nas kernel: RAX: 0000000044000000 RBX: ffff90aa11dc9200 RCX: 0000000000000000
Dec 08 17:46:00 nas kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90aa105f6800
Dec 08 17:46:00 nas kernel: RBP: ffff90aa105f6800 R08: 0000000000000000 R09: 0000000000000000
Dec 08 17:46:00 nas kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff90ab0eeb9600
Dec 08 17:46:00 nas kernel: R13: ffff90ab0b9b5000 R14: ffff90ab0b9b5094 R15: ffffffffc1286b50
Dec 08 17:46:00 nas kernel: FS:  0000000000000000(0000) GS:ffff90e84ea40000(0000) knlGS:0000000000000000
Dec 08 17:46:00 nas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 08 17:46:00 nas kernel: CR2: 00007f91aec4aeb8 CR3: 000000016ed4e000 CR4: 0000000000350ee0
Dec 08 17:46:00 nas kernel: Call Trace:
Dec 08 17:46:00 nas kernel:  <TASK>
Dec 08 17:46:00 nas kernel:  handle_ksmbd_work+0x143/0x3c0 [ksmbd f32740ff0b4cd2637a7ca0c1c3c4338b922dc778]
Dec 08 17:46:00 nas kernel:  process_one_work+0x1e5/0x3c0
Dec 08 17:46:00 nas kernel:  worker_thread+0x50/0x3c0
Dec 08 17:46:00 nas kernel:  ? process_one_work+0x3c0/0x3c0
Dec 08 17:46:00 nas kernel:  kthread+0x12f/0x160
Dec 08 17:46:00 nas kernel:  ? set_kthread_struct+0x50/0x50
Dec 08 17:46:00 nas kernel:  ret_from_fork+0x1f/0x30
Dec 08 17:46:00 nas kernel:  </TASK>
Dec 08 17:46:00 nas kernel: ---[ end trace 32de12428ea34942 ]---
Dec 08 17:46:00 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:47:18 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:47:25 nas ksmbd[11501]: [ksmbd-worker/11501]: ERROR: Bad message: logout_request
Dec 08 17:47:25 nas kernel: ------------[ cut here ]------------
Dec 08 17:47:25 nas kernel: WARNING: CPU: 21 PID: 14340 at fs/ksmbd/smb2pdu.c:2045 smb2_session_logoff+0xd8/0xe0 [ksmbd]
Dec 08 17:47:25 nas kernel: Modules linked in: cmac nls_utf8 ksmbd crc32_generic rpcrdma ib_umad rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm ib_i>
Dec 08 17:47:25 nas kernel: CPU: 21 PID: 14340 Comm: kworker/21:0 Tainted: P      D W  OE     5.15.6-arch2-1-forcerdma #1 b98c72527cd6d211cd8bbaff0b4cfab788>
Dec 08 17:47:25 nas kernel: Hardware name: Supermicro Super Server/H12SSL-C, BIOS 2.3 10/20/2021
Dec 08 17:47:25 nas kernel: Workqueue: ksmbd-io handle_ksmbd_work [ksmbd]
Dec 08 17:47:25 nas kernel: RIP: 0010:smb2_session_logoff+0xd8/0xe0 [ksmbd]
Dec 08 17:47:25 nas kernel: Code: 00 48 8b 7b 08 e8 38 45 ff ff 48 c7 43 08 00 00 00 00 48 8b 45 00 c7 40 40 04 00 00 00 31 c0 5b 5d 41 5c 41 5d 31 f6 89 f7>
Dec 08 17:47:25 nas kernel: RSP: 0018:ffffaff34c6abe38 EFLAGS: 00010287
Dec 08 17:47:25 nas kernel: RAX: 0000000044000000 RBX: ffff90ab1febd600 RCX: 0000000000000000
Dec 08 17:47:25 nas kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90aa10d8f700
Dec 08 17:47:25 nas kernel: RBP: ffff90aa10d8f700 R08: 0000000000000000 R09: 0000000000000000
Dec 08 17:47:25 nas kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff90ab0b9e6000
Dec 08 17:47:25 nas kernel: R13: ffff90ab0b9b7600 R14: ffff90ab0b9b7694 R15: ffffffffc1286b50
Dec 08 17:47:25 nas kernel: FS:  0000000000000000(0000) GS:ffff90e84ed40000(0000) knlGS:0000000000000000
Dec 08 17:47:25 nas kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 08 17:47:25 nas kernel: CR2: 000055a29e52c15c CR3: 0000000210506000 CR4: 0000000000350ee0
Dec 08 17:47:25 nas kernel: Call Trace:
Dec 08 17:47:25 nas kernel:  <TASK>
Dec 08 17:47:25 nas kernel:  handle_ksmbd_work+0x143/0x3c0 [ksmbd f32740ff0b4cd2637a7ca0c1c3c4338b922dc778]
Dec 08 17:47:25 nas kernel:  process_one_work+0x1e5/0x3c0
Dec 08 17:47:25 nas kernel:  worker_thread+0x50/0x3c0
Dec 08 17:47:25 nas kernel:  ? process_one_work+0x3c0/0x3c0
Dec 08 17:47:25 nas kernel:  kthread+0x12f/0x160
Dec 08 17:47:25 nas kernel:  ? set_kthread_struct+0x50/0x50
Dec 08 17:47:25 nas kernel:  ret_from_fork+0x1f/0x30
Dec 08 17:47:25 nas kernel:  </TASK>
Dec 08 17:47:25 nas kernel: ---[ end trace 32de12428ea34943 ]---

Jared_Hulbert · December 9, 2021, 12:01am

Bummer.

wallacebw · December 9, 2021, 1:22pm

For the moment, I’m shifting focus to a windows VM backed by virtiofsd. I grabbed a server 2022 iso, and will look at this today.

FYI: I’ve reverted by network back to lacp, put the server back into it’s normal config / kernel, etc; so let’s collect a few ideas I can test in one go for the win to KSMB rdma issues (if we think of any).

Update:

@wendell Did you have high expectations for virtiofs? seems to fall in line with other fuse based systems with high cpu overhead.

I setup a windows 2021 VM on my server used for all other testing. I enabled hugepages based memory backing and pass-through virtiofs filesystem.

VM Details

<domain type="kvm">
  <memory unit="KiB">8388608</memory>
  <currentMemory unit="KiB">1048576</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size="2048" unit="KiB"/>
    </hugepages>
    <access mode="shared"/>
  </memoryBacking>
  <vcpu placement="static">4</vcpu>
  <cpu mode="host-passthrough" check="none" migratable="on">
    <numa>
      <cell id="0" cpus="0-3" memory="8388608" unit="KiB" memAccess="shared"/>
    </numa>
  </cpu>
...
  <devices>
    <filesystem type="mount" accessmode="passthrough">
      <driver type="virtiofs" queue="1024"/>
      <binary path="/usr/lib/qemu/virtiofsd" xattr="on">
        <cache mode="always"/>
      </binary>
      <source dir="/pool/data/temp/vm_virtfs"/>
      <target dir="virtiofs"/>
      <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
    </filesystem>
   </devices>
...
</domain>

Performance against the virtiofs device “Z:” was horrible and was cpu bound (typical of fuse based filesystems)

Virtiofs ZFS path Map- Seq. Read ~ 91MB / 1465 IOPS

diskspd results vertiofs ZFS map

qcow VirtIO disk: Seq. Read ~ 1409MB / 22544 IOPS

diskspd results qcow VirtIO disk

wendell · December 9, 2021, 6:30pm

I had hoped it would be better

wallacebw · December 9, 2021, 6:52pm

My $0.02:

caveat emptor: I’m discounting FCOE or other technologies I cannot test. I am speaking in generalities, let your specific use case drive your solution.

If your use case between the storage and the consumer is *nix based (bsd / linux / vmware / etc). I would recommend a NFS tranport. If you have the ability to implement RDMA, absolutely do.

Note: I am not referring to your guest OS in the case of virtualization, but the host os (linux / ESX / etc .)

If you are in a Windows only use case, I would recommend SMB with multichannel, leveraging SMB_DIRECT (RDMA) if possible.
If you use case involves a hybrid environment (Windows + something else). I would prioritize RDMA over the transport protocol which at this time restrcts you to SMB (if you can make it work ). In the absence of RDMA/ROCE, I would next look SMB Multipath for high throughput workloads and NFS for everything else.

wallacebw · December 9, 2021, 7:23pm

I have one more test I would like to try. Enabling RDMA on a guest by passing the NIC device itself to the VM. I’m looking to try SR_IOV but don’t think I’ll have the chance to look at it until tomorrow.

https://community.mellanox.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet

@wendell @Jared_Hulbert Ok… I must be missing something simple, help me out here. I’m trying to get SR_IOV set to to allow using one of the ports for the host OS and then creating multiple vfs from the other port to pass through to VMs.

Enabled SR_IOV on BIOS and verified:

[ 1.152329] AMD-Vi: AMD IOMMUv2 loaded and initialized

Enabled SR_IOV on NICs firmware

sudo mstconfig -d 81:00.1 set SRIOV_EN=1 NUM_OF_VFS=4

reboot

echo 4 | sudo tee /sys/class/net/enp129s0f1np1/device/sriov_numvfs
lspci | grep  "Mel"
81:00.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
81:00.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
81:00.6 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
81:00.7 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function]
81:01.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] (rev ff)
81:01.1 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] (rev ff)

They all show up in the same iommu group

ls  /sys/kernel/iommu_groups/13/devices
0000:80:01.0  0000:80:01.1  0000:81:00.0  0000:81:00.1  0000:81:00.6  0000:81:00.7  0000:81:01.0  0000:81:01.1

What am I missing :? it looks like I would need to dedicate the entire card (both ports) to qemu, that can’t be right. What am I missing? if it helps i’m using a:

Supermicro H12SSL-C
BIOS Date: 10/20/2021 Ver 2.3

wendell · December 10, 2021, 1:41am

Iommu is on not auto in bios??

wallacebw · December 10, 2021, 4:27am

You sent me back to looking in the bios, thanks for the nudge in the right direction, Thanks.

IOMMU was ON not AUTO

I needed to:

Enable PCI AER Support (ACPI settings), was previously Disabled – I think I changed it because I don’t typically leave any ACPI on for servers

I needed to set ACS Enable to Enabled, was Auto

I also enabled PCIE ARI Support

Also added iommu=pt to kernel options

Thanks Again

wallacebw · December 10, 2021, 7:03pm

@wendell @Jared_Hulbert

I was able to set-up IOMMU and SR_IOV on the server and pass through the virtual nics to a windows server 2022 VM. With this done I was able to test both windows and linux clients against a windows server os.

To keep everything together, I added the results to the bottom of the existing results post ( How Can I Help with the new TRUENAS / 100G testing? - #29 by wallacebw ).

I am very impressed at both libvirt’s efficiency with sr_iov mellanox nics and with windows server as a VM. The results speak for themselves, but I have to call one out.

With dual vNICs in the windows guest and zfs backed raw (not qcow2) based vdisk, I was able to pull 184 Gbit of SMB IO . Here’s the fio result

FIO results

…
fio --name write --filename=fio.data --rw=read --size=100g --bs=1024k --ioengine=windowsaio --iodepth=256 --direct=1 --runtime=120 --time_based --group_reporting --numjobs=64
fio: this platform does not support process shared mutexes, forcing use of threads. Use the ‘thread’ option to get rid of this warning.
write: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=windowsaio, iodepth=256
fio-3.27
Starting 64 threads
Jobs: 1 (f=0): [(10),f(1),(53)][100.0%][r=15.9GiB/s][r=16.3k IOPS][eta 00m:00s]
write: (groupid=0, jobs=64): err= 0: pid=6020: Fri Dec 10 13:49:24 2021
read: IOPS=20.6k, BW=20.1GiB/s (21.6GB/s)(2449GiB/121670msec)
slat (usec): min=17, max=91408, avg=122.10, stdev=1117.64
clat (usec): min=453, max=4659.4k, avg=785604.18, stdev=1263593.23
lat (usec): min=496, max=4659.4k, avg=785726.28, stdev=1263647.37
clat percentiles (msec):
| 1.00th=[ 20], 5.00th=[ 35], 10.00th=[ 41], 20.00th=[ 46],
| 30.00th=[ 47], 40.00th=[ 47], 50.00th=[ 48], 60.00th=[ 50],
| 70.00th=[ 63], 80.00th=[ 2903], 90.00th=[ 3004], 95.00th=[ 3037],
| 99.00th=[ 3104], 99.50th=[ 3171], 99.90th=[ 4212], 99.95th=[ 4396],
| 99.99th=[ 4597]
bw ( MiB/s): min= 1440, max=75083, per=100.00%, avg=21372.99, stdev=173.83, samples=14689
iops : min= 1388, max=75052, avg=21347.80, stdev=173.83, samples=14689
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.04%, 10=0.19%, 20=0.78%, 50=61.27%
lat (msec) : 100=10.78%, 250=0.77%, 500=0.29%, 750=0.25%, 1000=0.19%
lat (msec) : 2000=0.59%, >=2000=24.84%
cpu : usr=0.00%, sys=3.31%, ctx=0, majf=0, minf=0
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.7%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.1%
issued rwts: total=2508208,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=20.1GiB/s (21.6GB/s), 4096MiB/s-20.1GiB/s (4295MB/s-21.6GB/s), io=2449GiB (2630GB), run=121670-121670msec

I would not have guessed it would have been that efficient, consider me impressed!

wendell · December 10, 2021, 9:54pm

this is very amaze!

It is too bad that we can’t get that efficiency passing throuygh a zfs filesystem transparently to the windows host. I was hoping there was a way to do that, but it seems not. Not without a lot of overhead.

I can manage about 4 gigabytes/sec with virtio FS but one cpu is pegged. It must be doing it a byte at a time or something inefficient…

Jared_Hulbert · December 10, 2021, 11:50pm

Pass through should have effectively no overhead. resources are handed directly to vm. virtualization has nothing to do.

Always good to see the numbers anyway!

Frankly, vnic’s + RDMA should be super efficient too even without pass-through, I wish we could get there faster.

I’m confused. Storage is where? Protocol is what? filesystem is on which machine?

wallacebw · December 11, 2021, 12:41pm

To keep everything together, I added the results to the bottom of the existing results post ( How Can I Help with the new TRUENAS / 100G testing? - #29 by wallacebw ).

Near the bottom you will find linux <-> windows and winows <-> windows. in short, it’s a windows VM on my server with a disk on a ZFS pool, there are more details above.

@wendell I updated the original post with the setup details and results. Is there anything else you want to test or have me validate before I revert my environment back to how it was?

wendell · December 11, 2021, 4:41pm

Are you still seeing only about 100 mbyte/sec with virtio-fs to zfs? I could manage 4 gigabytes/sec but the cpu is pegged a d something doenst feel right.

wallacebw · December 11, 2021, 7:32pm

For a windows guest, yes, ~100MB/s using the winfsp driver and instructions here. virtiofs - shared file system for virtual machines / Windows HowTo

The numbers seemed too far off, so I stood up a arch VM and am seeing slightly better numbers than your 4GB:

One cpu core pegged at 100% on the host

guest basically idle other than the FIO threads

READ: bw=6371MiB/s

If it matters, I am using hugepages for the VM and have 8 cores and 8GB ram assigned

FIO Results against virtio_fs:

fio --name read --filename=/mnt/virtiofs/test.dat --rw=read --size=20g  --bs=1024K --ioengine=libaio  --iodepth=128 --direct=1 --runtime=60 --time_based --group_reporting --numjobs=8
read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
fio-3.28
Starting 8 processes
Jobs: 8 (f=8): [R(8)][100.0%][r=6330MiB/s][r=6330 IOPS][eta 00m:00s]
read: (groupid=0, jobs=8): err= 0: pid=563: Sat Dec 11 19:32:25 2021
  read: IOPS=6364, BW=6365MiB/s (6674MB/s)(373GiB/60003msec)
    slat (usec): min=16, max=9529, avg=1254.71, stdev=145.73
    clat (usec): min=681, max=187620, avg=159388.23, stdev=5766.77
     lat (usec): min=1897, max=189183, avg=160643.18, stdev=5779.05
    clat percentiles (msec):
     |  1.00th=[  153],  5.00th=[  155], 10.00th=[  157], 20.00th=[  157],
     | 30.00th=[  159], 40.00th=[  159], 50.00th=[  159], 60.00th=[  161],
     | 70.00th=[  161], 80.00th=[  163], 90.00th=[  165], 95.00th=[  167],
     | 99.00th=[  169], 99.50th=[  171], 99.90th=[  174], 99.95th=[  178],
     | 99.99th=[  184]
   bw (  MiB/s): min= 4166, max= 6604, per=99.78%, avg=6351.03, stdev=28.43, samples=952
   iops        : min= 4166, max= 6604, avg=6351.03, stdev=28.43, samples=952
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=0.05%
  lat (msec)   : 100=0.08%, 250=99.84%
  cpu          : usr=0.28%, sys=3.04%, ctx=382005, majf=0, minf=262250
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=381906,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=6365MiB/s (6674MB/s), 6365MiB/s-6365MiB/s (6674MB/s-6674MB/s), io=373GiB (400GB), run=60003-60003msec
[root@archlinux arch]#

wendell · December 11, 2021, 8:28pm

Thanks. Virtio has the capacity to at least have less overhead but I’d probdy have to diy patch it myself to clear 10 gigabytes per second or more.

It is kinda sad it has more overhead than using something like an iscsi virtual function.

Too bad did windows doesn’t permit resharing on smb an nfs mount. Hilariously I bet that also has less overhead

wallacebw · December 11, 2021, 9:07pm

@wendell Try the using sockets and specifying the thread pool size. I hit 23GB/s

used approx 50% host CPU
hit 23GB/s

FYI: Didn’t help windows at all

ON the HOST
sudo /usr/lib/qemu/virtiofsd -o cache=always --socket-path=/var/run/virtiofsd.sock  -o source=/pool/data/temp/vm_virtfs --thread-pool-size=32

and In the VM Config:

<filesystem type="mount">
  <driver type="virtiofs" queue="1024"/>
  <source socket="/var/run/virtiofsd.sock"/>
  <target dir="tag"/>
  <alias name="fs0"/>
  <address type="pci" domain="0x0000" bus="0x07" slot="0x00" function="0x0"/>
</filesystem>

Full FIO output:

fio --name read --filename=/mnt/virtiofs/test.dat --rw=read --size=20g  --bs=1024K --ioengine=libaio  --iodepth=128 --direct=1 --runtime=60 --time_based --group_reporting --numjobs=8
read: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=128
...
fio-3.28
Starting 8 processes
Jobs: 8 (f=8): [R(8)][100.0%][r=19.0GiB/s][r=19.5k IOPS][eta 00m:00s]
read: (groupid=0, jobs=8): err= 0: pid=438: Sat Dec 11 21:35:00 2021
  read: IOPS=23.6k, BW=23.0GiB/s (24.7GB/s)(1380GiB/60001msec)
    slat (usec): min=14, max=13578, avg=337.06, stdev=356.95
    clat (usec): min=359, max=94128, avg=43105.98, stdev=7350.09
     lat (usec): min=771, max=94217, avg=43443.32, stdev=7396.18
    clat percentiles (usec):
     |  1.00th=[27919],  5.00th=[31851], 10.00th=[33817], 20.00th=[36963],
     | 30.00th=[39060], 40.00th=[41157], 50.00th=[42730], 60.00th=[44827],
     | 70.00th=[46924], 80.00th=[49021], 90.00th=[52691], 95.00th=[55837],
     | 99.00th=[62129], 99.50th=[64750], 99.90th=[71828], 99.95th=[74974],
     | 99.99th=[81265]
   bw (  MiB/s): min=18112, max=30728, per=100.00%, avg=23585.63, stdev=289.98, samples=952
   iops        : min=18112, max=30728, avg=23585.63, stdev=289.98, samples=952
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.02%, 50=83.02%
  lat (msec)   : 100=16.95%
  cpu          : usr=1.05%, sys=15.88%, ctx=1303629, majf=0, minf=262253
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=1413477,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
   READ: bw=23.0GiB/s (24.7GB/s), 23.0GiB/s-23.0GiB/s (24.7GB/s-24.7GB/s), io=1380GiB (1482GB), run=60001-60001msec

Jared_Hulbert · December 12, 2021, 1:02am

@wallacebw Okay I get it. Thanks.

50% Host CPU what is it doing?

@wendell

This might be a more fundamental problem than you think. The software driven path for data transfer to intra host is slower than you think.

memcpy() << DMA << zero-copy. memcpy() takes processor time the entire time, it’s just a loop doing load and store instructions.

And I think virtio is doing a memcpy(), it’s iffy if we’re gonna fix that.

So think of how a filesystem works. User process allocs() buffer and hands a pointer to read() which passes that pointer all the way down to the nvme drive itself. userspace->systemcall->vfs->zfs->block->nvme. Then the nvme drive DMA’s the data directly into the buffer.

The way SMB_DIRECT works is to effectively make that pointer be on the remote system, right? So virtiofsd needs to be RDMA compatible do get this to really work. Otherwise we’re gonna be memcpy() and therefore CPU load constrained. Thing is I’m not 100% clear on how Windows is going to pass the RDMA-ness to the viritofsd code.

BTW - I didn’t see the virtiofsd code in the first Google page, just virtiofsd-rs. Do know where the canonical repo is?

I’m interested in a collab here. At least until I review the current state of virtiofsd. Then we’ll have to see.

wallacebw · December 12, 2021, 3:47pm

I ran the full FIO test set on VIRTIOFS, here are the results. Also updated the OP to save prople from having to read the full thread.

Linux VM <-> Host with VIRTIO socket Finding Summary:

Setup Note: Arch VM with 8 cores, 8GB of ram backed by hugepages
Setup Note: Host leveraged virtiofs socket with a threadpool size of 32
Setup Note: I converted the number from GiB/s to Gbit/s by multiplying by 8 to better align with other results. (This is worst case as no protocol/transport overhead is factored in.)

Sequential IO

Engine	RDMA	Seq. Read	Seq. Read CPU	Seq. Write	Seq. Write CPU	Notes
VIRTIOFS	N/A	227.2 Gbit	~40%	46.7 Gbit	~80%

Random IO

Engine	RDMA	Read IOPS	Read CPU	Write IOPS	Write CPU	R/W IOPS	R/W CPU	Notes
SRIOVFS	N/A	49k	48%	2477	~72%	R: 2463 W: 22467	~75%

wendell · December 15, 2021, 10:53pm

Wellll I have some experience with this in other contexts. You have the right thinking, but things are actually pretty fast.

20 gigabytes/sec when we’re talking about 178 gigabytes/sec intrahost memory bandwidth is not suuuuuper unreasonable.

We’re already doing those kinds of numbers from guest to host in some of the experiments with looking glass and kicking around the shared memory device. Allthough it doesn’t really do it that way anymore.

With those virtual devices, we know only one memcpy is happening and we looked at doing pci/pci transfers from gpu memory to gpu memory “even faster” which the stubs are actually already there in the kernel but… that’s about it.

and a single 100 gigabit is only 10 gigabytes/sec. so I think 10 gigabytes/sec through a paravirtual device that SHOULD only be a single memcpy is not unreasonable.