Does anybody here have experience setting up NFSoRDMA in Ubuntu 18.04 LTS with the "inbox" driver?

alpha754293 · January 26, 2022, 3:56am

Let me go look at and/or download the user’s guide/manual for your ConnectX-3 EN.

Hang on.

Would you happen to know what the HP part numbers are for the card?

Or how the HP part numbers map onto the Mellanox part numbers?

I downloaded the product brief for the ConnectX-3 EN cards and apparently, it comes in four flavours:

MCX311A-XCAT Single 10 GbE SFP+
MCX312A-XCBT Dual 10 GbE SFP+
MCX313A-BCBT Single 40/56 GbE QSFP
MCX314A-BCBT Dual 40/56 GbE QSFP

So it would depend on what the HP part numbers map onto the Mellanox part numbers.

wendell · January 26, 2022, 4:03am

I somehow missed this thread. @alpha754293 you are awesome. That is all. Thank you.

alpha754293 · January 26, 2022, 4:04am

@wendell
Thank you.

You’re awesome yourself as well.

gordonthree · January 26, 2022, 4:05am

This is the card I’m using, the 546SFP+. Looks like that would be the MCX312A-XCBT?

https://support.hpe.com/hpesc/public/docDisplay?docId=c04636262&docLocale=en_US

Pulled this from lspci

                Product Name: HP Ethernet 10G 2-port 546SFP+ Adapter
                Read-only fields:
                        [PN] Part number: 779793-B21
                        [EC] Engineering changes: C-5733
                        [SN] Serial number: IL273302BL
                        [V0] Vendor specific: PCIe 10GbE x8 6W
                        [V2] Vendor specific: 5733
                        [V4] Vendor specific: 98F2B3CE9B50
                        [V5] Vendor specific: 0C
                        [VA] Vendor specific: HP:V2=MFG:V3=FW_VER:V4=MAC:V5=PCAR
                        [VB] Vendor specific: HP ConnectX-3Pro SFP+

alpha754293 · January 26, 2022, 4:26am

So with it being a ConnectX-3 Pro EN card, the Mellanox part number that the dual 10 GbE SFP+ variant maps onto is the MCX312B-XCCT.

“Since the SM is not present, querying a path is impossible. Therefore, the path record structure must be filled with relevant values before establishing a connection. Hence, it is recommended working with RDMA-CM to establish a connection as it takes care of filling the path record structure”

(Source: RDMA over Converged Ethernet (RoCE) - MLNX_EN v4.9-4.1.7.0 LTS - NVIDIA Networking Docs)

You can read this section (RDMA over Converged Ethernet (RoCE) - MLNX_OFED v4.9-4.1.7.0 LTS - NVIDIA Networking Docs) on how to set up and enable RoCE.

Again, with your card being a ConnectX-3 Pro EN, you can use RoCE v2 if you want to.

I’ve never tried setting up RoCE on my Infiniband cards, but from reading the MLNX OFED driver documentation about (the documentation for the EN driver points to the “full” MLNX OFED driver documentation for instructions on how to enable RoCE and it seems like it would be quite the pain (in the absence of a subnet manager) because you have to maybe perform extra steps according to the documentation/instructions, so I have no idea how to do that.

Sorry.

gordonthree · January 26, 2022, 4:24pm

The info in that KB article seems to be one of the steps I was missing. Enabling RoCE v2 via modprobe options for mlx4_core has gotten me a few steps farther! I can now do rping and ucmatose between the servers, in both directions.

I was able to get nfs to mount using version 3, which I think is OK. It says it’s using RDMA, but when I copy large files, I’m able to see the traffic on the interface counters, which means the kernel / os can see the traffic as well, so RDMA isn’t actually doing it’s magic? At least traffic is showing up on the correct interface!

Command: mount -o proto=rdma 10.19.0.14:/mnt/vmstore /mnt/nfs-ssd
Result:

10.19.0.14:/mnt/vmstore on /mnt/nfs-ssd type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=10.19.0.14,mountvers=3,mountproto=tcp,local_lock=none,addr=10.19.0.14)

and on the NFS server:

Jan 26 11:15:18 server kernel: mlx4_core 0000:0b:00.0: Have more references for index 1, no need to modify mac table
Jan 26 11:15:18 server rpc.mountd[4062]: authenticated mount request from 10.19.0.16:1006 for /mnt/vmstore (/mnt/vmstore)
Jan 26 11:15:18 server kernel: mlx4_core 0000:0b:00.0: Registering MAC: 0x98f2b3ce9b51 for port 2 without duplicate

alpha754293 · January 26, 2022, 5:21pm

Yeah…I’m not really sure how RDMA over Converged Ethernet works because in theory, RDMA shouldn’t affect the network counters, but the Ethernet should/does.

re: " mount -o proto=rdma 10.19.0.14:/mnt/vmstore /mnt/nfs-ssd"
I was reading the Linux man pages for nfs(5) and I don’t have a better understand about how proto interacts with mountproto when proto=rdma.

(Source: nfs(5) - Linux man page)

So…I have no idea.

(edit#2 On the IB side, things work a little bit differently:

aes0:/home/cluster on /home/cluster type nfs4 (rw,noatime,nodiratime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=10.0.1.117,local_lock=none,addr=10.0.1.100)

I don’t have mountproto for my NFSoRDMA mounts.)

I guess that one of things that you might be able to check would be to see if you have any real appreciable speed differences (if any) and/or any real or noticable/appreciable differences in reducing any CPU load/utilisation (again, if any, at all).

I know that you said that even without RoCE, you were saying that the CPU utlisation was very low, so I’m not sure if you might really see nor notice a difference.

But if rping and ucmatose is working, so that suggests that it is working for you, and I assume that this also means that prior to that, it meant that said rping and ucmatose WASN’T working for you previously.

On my Infiniband side of things, it RDMA works differently because with IB, I am LITERALLY and COMPLETELY skipping over the Ethernet layer and I assign a IPv4 address (which works with IPoIB) so none of my “normal” system monitoring tools for the network will read or be able to pick up on the IB/RDMA traffic.

As such, I have no idea how RoCE works because I don’t have it deployed on my cluster.

(If 100 GbE switches weren’t still so damn expensive, I might look into potentially deploying 100 GbE as well (because my ConnectX-4 cards are dual port, VPI which means that on the same card, I can have one port running in IB mode and the other port running in ETH mode. But at that point, the PCIe 3.0 x16 interface will become my limiting factor because that can only support upto 128 Gbps and a single 100 Gbps IB port can already take up almost all of that bandwidth, which means for me to get more bandwidth, I would have to replace my cards and systems with something that supports at least PCIe 4.0 x16, which I’m not looking to do). But I digress.)

Sounds like it’s up and running for you.

Just make sure that your /etc/rdma/rdma.conf parameters:

on the host and clients:
XPRTRDMA_LOAD=yes

And additionally, on the host:
SVCRDMA_LOAD=yes

are set.

(If you want to test it, and if your systems have enough RAM available, you can try and create a RAM drive (tmpfs), write a file to it, and then send it over and see what kind of speeds you’re getting. That might be one way for you to test to see if RDMA is working for you or not.)

I have also found that in “normal” data transfers (i.e. for storage management), RDMA is really limited to the speeds of the storage devices. So for me, because I am using spinning rust hard drives, 800 MB/s write (~6.4 Gbps) is about the best that I can do.

Conversely, when I use an application that uses the message passing interface (MPI), the application can use upto around 80 Gbps during a solve process.

So, true to its name, remote direct memory access - the speeds are usually realised when it’s RAM-to-RAM transfers. I think that even with four Samsung 860 EVO 1 TB SATA 6 Gbps SSDs in RAID0, the best that I’ve been able to see is maybe around 4.69 GB/s (~37.52 Gbps), but that’s the exception rather than the norm. (And sometimes, this is possible due to RDMA to RAM to SSD cache so it’s measuring the interface-to-cache speed.)

So…that might be a way for you to test to check and verify that RoCE is working properly for you.

edit

I’m not sure if it matters or not, but I am using NFS version 4.1 on my CentOS cluster.

gordonthree · January 26, 2022, 7:43pm

What a great thread, I’m learning a lot and appreciate your detailed responses.

Good point about ethernet frames incrementing the counters, that could be what’s going on, I was using glances to monitor CPU and network while copying a 100G file from one SSD to another over the network, so it could have been showing me ethernet utilization, rather than tcp/ip throughput. I will have to use other tools to see exactly what counters are incrementing.

As I understand it, after reading a kernel mailing list thread from 2012 (here), mountproto is the method used to connect and disconnect the mount, for nfs v2 and v3 connections, where as proto is the transport method used for carrying the data. So tcp was used to authenticate or whatever with the nfs server, and rdma is used to transport data, I hope?

I’m not certain when, but it looks like the xprtrdma and svcrdma modules are obsolete, and replaced by rpcrdma now… maybe something new in Linux 5.x?

modinfo xprtrdma
filename:       /lib/modules/5.15.14-200.fc35.x86_64/kernel/net/sunrpc/xprtrdma/rpcrdma.ko.xz
alias:          rpcrdma6
alias:          xprtrdma
alias:          svcrdma
license:        Dual BSD/GPL
description:    RPC/RDMA Transport
author:         Open Grid Computing and Network Appliance, Inc.
depends:        ib_core,sunrpc,rdma_cm

An oddity I noticed, one of my virtual machines seems to have established an NFSv4 RDMA connection to the host server. The VM is using an SR_IOV virtual function passed through from the Mellanox card. Not sure why it works, but I won’t complain.

the-ripper:/mnt/storage/movies on /mnt/storage/movies type nfs4 (rw,relatime,vers=4.2,rsize=8192,wsize=8192,namlen=255,soft,proto=rdma,port=20049,timeo=14,retrans=2,sec=sys,clientaddr=192.168.2.12,local_lock=none,addr=192.168.2.14)

However, if I force my test environment to use nfs vers=4, it complains with a protocol error.

Hey at least this is enough progress to keep me interested.

gordonthree · January 26, 2022, 9:40pm

Testing with cached reads on the server, and a ramdisk on the client, I was seeing about 8 out of 10gbit at times, just showing up as packets on the interface. the tcp and udp transport numbers were barely moving, probably from other things the server is doing.

The cp command was showing between 70 and 99% single thread CPU usage. But maybe that’s just how cp works, rather than cpu load from transport overhead. I didn’t see nfsd spike on cpu usage for example.

So I guess that means RDMA is actually skipping some of the middle-men along the way!

Now I just need to figure out why my ubuntu virtual machine will talk NFS v4.2 over RDMA to my server, but my Fedora physical machine will only talk NFS v3 over RDMA

EDIT:
A bit more, not very scientific testing., Copying from a ramdisk, to a ramdisk,

RDMA direct cable connection: Peak of 8gbit sustained, nfsd cpu usage 6 to 9%.
RDMA through basic L3 switch (no DCB/QoS support): Peak of 8gbit sustained, nfsd cpu usage 9-11% (more retries maybe?)
TCP direct connection: Peak of 6gbit sustained, nfsd cpu usage 16-20%
TCP through the switch: same as above

alpha754293 · January 27, 2022, 2:08am

Thank you and you’re welcome.

Yeah, so for IB, because RDMA completely bypasses Ethernet and the network stack, I HAVE to use IB specific tools to be able to read the NIC’s packet counters, etc.

(I haven’t ran my IB card in ETH mode in probably over 3-5 years now, so my memory on it is a little bit fuzzy. I think that I VAGUELY remember that when I do run it in ETH mode though, that the GNOME System Monitor will pick up on the ethernet traffic (and show/plot it). But I don’t remember if that was when I JUST got my cards so I was just using a DAC cable between two systems to test it out and to make sure that the cards were working, and LONG before I learned about everything else that I have learned since then.)

Yeah…not sure.

My CentOS cluster headnode is still running the 3.10.0-1062 kernel that ships with CentOS 7.7.1908. My principle thought process with running such an “old” kernel is that for my 4930K that’s in my cluster headnode, it doesn’t have NVMe slots, it doesn’t have a lot of things. So, if it isn’t broken, there is no real reason for me to update it. (If it works, don’t break it.)

I also, further, don’t know if those modules are a RedHat/CentOS thing or if it is something else. shrug

For the record though, svcrdma doesn’t appear to be a module because when I type in lsmod | grep rdma, it doesn’t show up in there.

As I mentioned, it’s in /etc/rdma/rdma.conf. Again, shrug. YMMV.

Like I said earlier, “if it works, don’t break it!”

lol…

(Sidenote: I would recommend you keeping a OneNote or some other note taking tool/app/whatever so that you can document your deployment notes from the lessons learned here. This is how I was able to provide the instructions for how I deploy my NFSoRDMA setup so if you have stumbled upon what is working for you, now would be going back through the command history and jotting that down so that in the event that you have to redeploy your server/clients/both – you will have all of the commands that you need for said deployment.)

I’m glad that this appears to be working for you now.

Yay!

That’s pretty good. 80% of the theorectical/rated capacity is usually what I would observe, max, in service/in practice.

Yeah…so, that depends.

Like I know that if I use rsync, rsync has/does its own thing to make sure the sources and targets are the same, so it’s doing some kind of checking with that, so I know that takes CPU load to do that kind of processing.

For nfsd, I would see it show up in the load averages in top, but in the GNOME System Monitor, it doesn’t show up as CPU load. shrug

Yeah…that’s weird.

And what makes that even weirder is that my older CentOS 7.7.1908 is able to negotiate the NFS connection using nfs4 version 4.1 automatically whilst the newer Fedora didn’t/couldn’t.

So, there might be a bit more research needed into that, if you are really that interested in it. (I don’t have Fedora set up/deployed, so unfortuantely, I don’t have a way to test that, plus again, the fact that I am using IB instead of ETH also makes a difference.)

re: your results
Looks like the NFSoRDMA between your server and Fedora and Ubuntu clients are working (although, like you said though, it might appear that the Ubuntu, with SR-IOV, might be working a LITTLE bit better than your physical Fedora client). But it still appears to be working, which is the important piece.

(Sidenote: I have struggled to get Ubuntu to be the NFSoRDMA SERVER. I don’t have it in my OneNote notes that shows that I ever got that working properly because it would complain about proto=rdma nor would it take port=20049. So…it seems that Linux distros that are derived from RedHat works better for NFSoRDMA servers than Ubuntu servers/systems. I might revisit that in the future, but again, right now, CentOS is working for me as my NFSoRDMA server, so I’m not going to spend much time messing with it when it (already) works. Ubuntu clients seems to work better than Ubuntu servers for this. shrug Go figures.)

You get higher transfer rates and a lower CPU usage with NFSoRDMA vs. without, so that gives some data/confidence that it appears to be doing what it is supposed to.

I just googled “RDMA copy” and this is a project that came up:

So…might be worthwhile for you to look into to see if that might help reduce your CPU load even further, if you really want/need/or otherwise interested in trying to do so.

But 8 Gbps on a 10 GbE link is about where I would expect it to max out unless you start spending quite a bit of time, tuning the network performance parameters.

For an untuned network, it’s still pretty good, I think.

For my 100 Gbps IB network, given that in practice, my slowest storage devices are mechanically rotating hard drives, so I don’t even bother trying to tune my network because the best write speeds that I can get is about 800 MB/s (or ~6.4 Gbps out of 100 Gbps that’s theorectically possible).

It’s good enough. Faster than my GbE network, and I try not to use SSDs of any kind (if I can avoid it, as much as possible) because if I managed to “unlock” writing to say NVMe SSDs at the full ~12.5 GB/s (100 Gbps), I would burn through the write endurance limit on said SSDs, even enterprise grade ones, through repeated usage of the blazing fast speed (which would mean the SSD would die, and I would have to try and RMA it for a new one).

So I don’t bother with it anymore.

Yes, it’s nice to see those speeds, but if/when you handle or process enough data where you can kill a SSD in a little over a year-and-a-half, it just isn’t worth it to me anymore.

gordonthree · January 27, 2022, 2:08pm

I’ll check that out. I have a few bookmarks saved for doing zfs snapshots send/recv over RDMA, that’s probably the bulk of what I’d be sending over the network between the two machines.

I wonder if this has something to do with the bottleneck? I’m not sure if the 16.000Gb/s the kernel is talking about is gigabytes, or gigabits, but it seems my network card is only linking at PCIe 1.0 speed 2.5 GT/s, vs 5.0 GT/s for PCIe2 and 8.0 for PCIe3, IIRC

threadripper machine:

mlx4_core 0000:0b:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x8 link at 0000:00:03.2 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)

dual xeon machine:

mlx4_core 0000:85:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)

alpha754293 · January 27, 2022, 3:41pm

Unfortunately, I don’t have any direct experience related to this, but my thoughts is that depending on the mechanism that it is using for the data transfer (whehter it’s “straight” cp or if it is rsync or if it is RDMA copy, there might be ways to make the transfers go faster).

As a part of my google search for “RDMA copy” yesterday, I also found this (that might be of interest to you) if your transfers are primarily unidirectional:

Gigabits.

A PCIe 1.0 2.5 GT/s x8 link is capable of 2.000 gigabytes per second transfers (16 gigabits/s).

(Source: PCI Express - Wikipedia)

Whereas a PCIe 3.0 8 GT/s x8 link is capable of 7.877 GB/s (64 Gb/s) transfers.

But it is interesting and strange that your card is only connecting at PCIe 1.0 x8, so I would check to make sure that the BIOS settings are correct and that you might need to power down the system so that you can unseat and reseat the card to make sure that it is making proper, electrical contact.

That seems strange that your Threadripper machine is doing that.

But to your question if it might have something to do with it - it is entirely possible.

I don’t remember if the PCI Express specification denotes the bit rates or the transfer rates (in transfer per second) as being unidirectional or bidirectional.

If it is bidirectional, then it would make sense for the card to be limited by 8 Gbps transfer rate (half of 16 Gbps). But if it is unidirectional, then even with a 8 Gbps transfer rate, there should be still another 8 Gbps of headroom available, so not really sure.

But I would definitely look into that and try and figure out why it is doing that because your Threadripper system should have no shortage of PCIe 4.0 lanes to support this even if the card itself is only a PCIe 3.0 card.

gordonthree · January 27, 2022, 4:06pm

My box is an old 2017 era Threadripper, so no PCIe 4.0, but plenty of 3.0 lanes.

Plus, I am doing weird, and certainly unsupported things on this box. I have a Supermicro x8x8 card connected to a vertical GPU riser cable, plugged into an x16 slot which is configured for x8x8 in the BIOS. I suspect there’s an issue with signal degradation somewhere along the way, or maybe just a loose connection

I have the Mellanox card and an Intel x520 in the vertical gpu bay of my case. The Intel card is sync’d up at full speed, but the Mellanox is not. Every slot except for an x4 pcie 2.0 utilized on this machine, so it’s rather busy in there

alpha754293 · January 27, 2022, 8:04pm

Possibly.

Yeah…so you know your system best.

But if, maybe, perhaps, there is an opportunity for you to swap the Mellanox card around so that it gets its own, dedicated x16 slot or if there is some other way to rotate the cards around that would try to give or ensure that the Mellanox card gets its full bandwidth, that can be helpful/useful.

The other option might be to see if you can either force the PCIe x8 slot on the riser card to GEN_2 or maybe even possibly GEN_3 link mode and see if your system will POST, that might be another possible solution.

But I mean, other than that, it sounds like that the rest of the whole NFSoRDMA thing appears to be working for you, so once you can figure out an arrangement of add-in cards that works for you, then you should be good to go.

gordonthree · February 1, 2022, 1:49am

Took the case cover off, the card was actually partly unseated from the slot somehow! Shut down, reseated it, pcie link up at the full x8 63gb bandwidth.

I have my various virtual machines, and a workstation talking to the server’s NFS mounts via RDMA now. Don’t know what I’ve gained, other that knowledge and inside into some datacenter grade networking voodoo.

Thanks again for all the help!

alpha754293 · February 1, 2022, 4:58am

That’s good that you were able to reseat the card.

Well…I think that you also gained faster transfer speeds and/or at least lower CPU utilisation for the same transfers (even if it was only a the slower 6 Gbps, but even better if it was transferring at the 8 Gbps).

Now that you have the card reseated, you might want to run the ramdrive-to-ramdrive test again (optional, of course).

Learning something new, in my opinion, can sometimes be the “prize” unto itself.

(e.g. my 12900K system - the Realtek RTL8125 2.5 GbE NIC is possibly causing a kernel panic in CentOS 7.7.1908 (conflicting module possibly?). Either way, I’ve disabled that NIC and I am controlling that system entirely over my 100 Gbps Infiniband network instead. I’ll probably plug the cable back in when I need to interact with it via VNC, but at least the simulation that I’ve been trying to get it to run for a few days now is finally going and the system is stable enough with the RTL8125 NIC disabled, that it looks like it has a higher probability of actually FINISHING the run this time. My point is, sometimes, you end up in really odd/crazy situations like this where you end up using your 10 Gbps interface to work with the system instead of through the “normal” RJ45, GbE NIC. shrug)

You’re welcome.

Glad that I was able to help you get your system up and running.

andrewbrilliant · August 2, 2022, 9:53am

Hello,
I might have some useful information to contribute on the Infiniband side.

We have a bunch of different scenarios here so I might know where some people are succeeding, and others are not. Some machines here have only 2 x16 slots used for GPU compute and then one x8 slot so they couldn’t take a ConnectX4 NIC. So we wound up with a mix of CX3 and CX4 cards. It is a slightly confusing deal with Mellanox and NFSoRDMA support. Different OS/versions and the NICs have a compatibility matrix. For ConnectX-3 you need the LTS or 4.x OFED and for CX4 and newer NICs it’s the 5.x version drivers.
Mostly older OS are paired with support for older NICs. So if you are trying on ubuntu with a CX3 card you need to be on Ubuntu 18.04 at the newest; if you want to do it on a newer release you need to upgrade to a CX4 NIC. That will have you on 5.x version of OFED that supports Ubuntu 20.04. Also, it’s only 18.04.0 18.04.1+ kernel is not supported by the OFED installer it won’t run and says unsupported kernel without any further information. I did install OFED and upgrade the kernel afterward with some success on other OS and versions the same move would break the drivers. The installers don’t warn you about any of this, but the info is buried in various places; you must know what to search for.
If you use your inbox drivers and they have NFSoRDMA switched on already, you are sweet. That is most RHEL/CentOS and SUSE releases as well. But if you need to use OFED, you are at the mercy of the Mellanox supported list.

Also, there’s a switch you have to run with the mlnxofedinstall script: --with-nfsrdma and it installs extra packages. By default, it doesn’t install support even on an OS listed as supported.

Here is a link to the official setup doc: https://support.mellanox.com/s/article/howto-configure-nfs-over-rdma--roce-x. This doc is old and CentOS 7.6 specific instructions and leaves out you need to set the switch on the installer to enable it.

After that the command that enables rdma once you have NFS server setup:
echo rdma 20049 > /proc/fs/nfsd/portlist

this is the error you will get if you don’t have support from OFED or inbox drivers:
bash: echo: write error: Protocol not supported

Also, try running the command twice. On Ubuntu, for some reason, I get an error the first time but the second time works. Happens repeatability, far out.

After that just check that you have this stuff enabled as others mentioned
client:
modprobe xprtrdma

server:
modprobe svcrdma
Should work I have done on the most recent few versions of most popular Linux flavors as long as you have enabled inbox drivers or OFED support.

alpha754293 · August 2, 2022, 4:13pm

Thank you.

I appreciate your insights.

So I created this thread a little over 2-and-a-half years ago.

Back then, the Mellanox OFED driver did not support NFSoRDMA according to Mellanox’s support staff’s own admission:

"Our own driver, Mellanox OFED, does not support NFSoRDMA.

Many thanks,

~Mellanox Technical Support"

(Source: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps? - Software And Drivers - NVIDIA Developer Forums)

(Albeit that was with CentOS 7.6.1810, but the same was true for Ubuntu as well.)

In fact, if anything, I think that I was trying to run OpenFOAM at the time, and there was a critical error that installing OpenFOAM on CentOS produced that I have it noted in my OneNote file (for CentOS):

“DO NOT INSTALL OpenFOAM!!!”

lol…

As a result of that, I probably switched to Ubuntu 18.04 and because of Mellanox’s then own admission that their own drivers at the time, didn’t support NFSoRDMA (in direct contravention to their marketing claims/materials), that’s probably why I created this thread (because I can get NFSoRDMA with the “inbox” driver that ships with CentOS, but I probably wasn’t equally successful with the “inbox” Ubuntu driver for mlx5_core, (and the rdma kernel modules), etc.

To that end though:

It’s been over 2 years since I’ve tried using Ubuntu and NFSoRDMA.

(My micro cluster is now exclusively running CentOS because I know it works, save for my not running OpenFOAM anymore.)

For the record, I’m using ConnectX-4 cards. (The original, not the EN nor LX versions of that card.)

Mellanox DID eventually re-enable the NFSoRDMA capability in their driver (I think as of version 4.9.x) but because of my experience where Mellanox will just disable those features at will, for seemingly no apparent reason), therefore; as a result of them pulling something like this, I don’t trust their drivers anymore (that they aren’t going to deactive features/functionality from their drivers, again, at-will, for seemingly no apparent reason, in the future).

So, I don’t use their drivers, and I’ve blacklisted their drivers as a philosophical measure.

(The CentOS “inbox” drivers works well enough for what I am doing. The latest tests running ib_send_bw between two Ryzen cluster nodes obtains 96.58 Gbps out of a possible 100 Gbps, which is about as good as I can really get on it without doing a LOT of performance tuning as the workload on said 100 Gbps IB line varies significantly.)

Yeah, I’ve read that in the driver documentation/manual.

Yeah, for the CentOS “inbox” drivers, on the host/server; to enable RDMA, I do this:

To enable NFS over RDMA on host:
host$ sudo vi /etc/rdma/rdma.conf
XPRTRDMA_LOAD=yes
SVCRDMA_LOAD=yes
save, quit

and this:

host$ sudo vi /etc/sysconfig/nfs
RPCNFSDARGS="--rdma=20049"
save, quit

On the client side, I do this:

To enable NFS over RDMA on client:
client$ sudo vi /etc/rdma/rdma.conf
XPRTRDMA_LOAD=yes
save, quit

(Those are from my own notes from my OneNote.)

I haven’t tried running 100 Gbps IB in Ubuntu 18.04 (or any other version for that matter) since I posted this.

(Mostly because a lot of the other programs that I use do not always run in Ubuntu for a variety of reasons. (Too many to list here.), whereas those programs, except for OpenFOAM runs in CentOS, so that’s why I’ve stuck with CentOS on my micro HPC cluster.)

Maybe if I’m bored one day and want a headache, I’ll try this again on my Ryzen nodes. (lol…)

(My old Xeon cluster is decommissioned. Wife says that the system is too loud and gives her a headache.)

I just didn’t know if the set up was as easy compared to CentOS, because in CentOS, you run:

# yum groupinstall -y 'Infiniband Support'

and it will pretty much install everything that you need for it.

And then you edit those files, set up your NFS shares, export them, and then mount them with:

host:/home/cluster /home/cluster nfs defaults,rdma,port=20049,noatime,nodiratime 0 0

and [boom], you’re up and running.

It’s really easy.

I think that when I originally wrote this, Ubuntu wasn’t as easy like this; coupled with the fact that Mellanox took out NFSoRDMA support from their then driver package (again, inexplicably), so I didn’t know if Ubuntu back then had a similar way of getting around the MLNX OFED limitation a la CentOS.

Thanks.

mith · August 2, 2022, 4:51pm

So what are the minimum requirements for attempting NFS over RDMA in the first place? My mikrotik switch does not do ecn but perhaps that’s not needed for RoCE? This wikipedia article would seem to indicate that lossless ethernet is not a strict requirement.

So if I have a couple of RoCE capable ConnectX-3 cards then NFS over RDMA is theoretically possible? It probably wont make too much of a difference but my curiosity is piqued.

andrewbrilliant · August 2, 2022, 8:32pm

Wow that sounds like a mess. Yes I definitely agree the easiest thing is just to use a distro where the inbox drivers support it! IMO Mellanox is just doing planned obsolescence of the older NICs. Which you can hear the argument for but I think most times, its about the money. In reality something like openfoam or ansys in our case sees very little speedup from FDR to EDR IB and the big jump is having RDMA going. That doesn’t sell network cards though.

Edit: I also found Mellanox support useless. They wanted me to pay to update the support for the switch just to send me docs. I did, and they sent me the same docs, but they were just the same as google found and gave zero helpful information. I feel like 90% of the practical side of IT is learned in the field once you get the groundwork theory understood.

To learn which versions were supported I stepped through the latest five versions of every OS and just tried OFED. Let’s just say their matrix leaves much to be desired.