Just picked up a couple ConnectX-4 cards (with VI, so IB/ETH capable) and some intel 100gb SFP modules.
Trying to get NFS over RDMA working, and on the surface it appears to be. With the adapters in ethernet mode, I can mount the NFS share with the RDMA option, and I get about 600MB/s of throughput to the remote mount point (I’m not seeing heavy CPU usage either in htop, so I’m assuming that RDMA really is working in some capacity). Though it’s worth noting that mounting without explicitly setting the rdma option results in identical performance.
But 600MB/s is far shy of the 80-90gbps (10-11GB/s) I’m seeing from the ib_write_bw test, and the latency is about 5 times what ib_write_lat is showing as a practical minimum. Testing with a tmpfs mount doesn’t improve this at all.
The NFS server is a Dell T420 with dual 2470v2’s and 96GB of 1333MHz DDR3. The client is a supermicro board with dual 2690v2’s with 160GB of 1600MHz DDR3. Both adapters are in PCIe 3.0 x16 slots and lshca
reflects this.
I’m running the latest firmware for these cards (Dell FW from 2021), and I’ve got the latest OFED driver from Nvidia (I’m on Rocky 8.7, tried the inbox drivers from the repo but one card wouldn’t report all of it’s info, and nfs wouldn’t mount with the rdma option either).
So basically, I’m trying to figure out why I’m getting 600MB/s instead of, say, 4000-8000MB/s, and why my latency is 5 times higher than what it could be.
Ultimately, the plan is to share a Jenkins workspace between two servers for code compile. Possibly use distcc as well in the future for distributed compilation.