Does anybody here have experience setting up NFSoRDMA in Ubuntu 18.04 LTS with the "inbox" driver?

My box is an old 2017 era Threadripper, so no PCIe 4.0, but plenty of 3.0 lanes.

Plus, I am doing weird, and certainly unsupported things on this box. I have a Supermicro x8x8 card connected to a vertical GPU riser cable, plugged into an x16 slot which is configured for x8x8 in the BIOS. I suspect there’s an issue with signal degradation somewhere along the way, or maybe just a loose connection

I have the Mellanox card and an Intel x520 in the vertical gpu bay of my case. The Intel card is sync’d up at full speed, but the Mellanox is not. Every slot except for an x4 pcie 2.0 utilized on this machine, so it’s rather busy in there :smile:

Possibly.

Yeah…so you know your system best.

But if, maybe, perhaps, there is an opportunity for you to swap the Mellanox card around so that it gets its own, dedicated x16 slot or if there is some other way to rotate the cards around that would try to give or ensure that the Mellanox card gets its full bandwidth, that can be helpful/useful.

The other option might be to see if you can either force the PCIe x8 slot on the riser card to GEN_2 or maybe even possibly GEN_3 link mode and see if your system will POST, that might be another possible solution.

But I mean, other than that, it sounds like that the rest of the whole NFSoRDMA thing appears to be working for you, so once you can figure out an arrangement of add-in cards that works for you, then you should be good to go.

Took the case cover off, the card was actually partly unseated from the slot somehow! Shut down, reseated it, pcie link up at the full x8 63gb bandwidth.

I have my various virtual machines, and a workstation talking to the server’s NFS mounts via RDMA now. Don’t know what I’ve gained, other that knowledge and inside into some datacenter grade networking voodoo.

Thanks again for all the help!

1 Like

That’s good that you were able to reseat the card.

Well…I think that you also gained faster transfer speeds and/or at least lower CPU utilisation for the same transfers (even if it was only a the slower 6 Gbps, but even better if it was transferring at the 8 Gbps).

Now that you have the card reseated, you might want to run the ramdrive-to-ramdrive test again (optional, of course).

Learning something new, in my opinion, can sometimes be the “prize” unto itself.

(e.g. my 12900K system - the Realtek RTL8125 2.5 GbE NIC is possibly causing a kernel panic in CentOS 7.7.1908 (conflicting module possibly?). Either way, I’ve disabled that NIC and I am controlling that system entirely over my 100 Gbps Infiniband network instead. I’ll probably plug the cable back in when I need to interact with it via VNC, but at least the simulation that I’ve been trying to get it to run for a few days now is finally going and the system is stable enough with the RTL8125 NIC disabled, that it looks like it has a higher probability of actually FINISHING the run this time. My point is, sometimes, you end up in really odd/crazy situations like this where you end up using your 10 Gbps interface to work with the system instead of through the “normal” RJ45, GbE NIC. shrug)

You’re welcome.

Glad that I was able to help you get your system up and running. :slight_smile:

1 Like

Hello,
I might have some useful information to contribute on the Infiniband side.

We have a bunch of different scenarios here so I might know where some people are succeeding, and others are not. Some machines here have only 2 x16 slots used for GPU compute and then one x8 slot so they couldn’t take a ConnectX4 NIC. So we wound up with a mix of CX3 and CX4 cards. It is a slightly confusing deal with Mellanox and NFSoRDMA support. Different OS/versions and the NICs have a compatibility matrix. For ConnectX-3 you need the LTS or 4.x OFED and for CX4 and newer NICs it’s the 5.x version drivers.
Mostly older OS are paired with support for older NICs. So if you are trying on ubuntu with a CX3 card you need to be on Ubuntu 18.04 at the newest; if you want to do it on a newer release you need to upgrade to a CX4 NIC. That will have you on 5.x version of OFED that supports Ubuntu 20.04. Also, it’s only 18.04.0 18.04.1+ kernel is not supported by the OFED installer it won’t run and says unsupported kernel without any further information. I did install OFED and upgrade the kernel afterward with some success on other OS and versions the same move would break the drivers. The installers don’t warn you about any of this, but the info is buried in various places; you must know what to search for.
If you use your inbox drivers and they have NFSoRDMA switched on already, you are sweet. That is most RHEL/CentOS and SUSE releases as well. But if you need to use OFED, you are at the mercy of the Mellanox supported list.

Also, there’s a switch you have to run with the mlnxofedinstall script: --with-nfsrdma and it installs extra packages. By default, it doesn’t install support even on an OS listed as supported.

Here is a link to the official setup doc: https://support.mellanox.com/s/article/howto-configure-nfs-over-rdma--roce-x. This doc is old and CentOS 7.6 specific instructions and leaves out you need to set the switch on the installer to enable it.

After that the command that enables rdma once you have NFS server setup:
echo rdma 20049 > /proc/fs/nfsd/portlist

this is the error you will get if you don’t have support from OFED or inbox drivers:
bash: echo: write error: Protocol not supported

Also, try running the command twice. On Ubuntu, for some reason, I get an error the first time but the second time works. Happens repeatability, far out.

After that just check that you have this stuff enabled as others mentioned
client:
modprobe xprtrdma

server:
modprobe svcrdma
Should work I have done on the most recent few versions of most popular Linux flavors as long as you have enabled inbox drivers or OFED support.

1 Like

Thank you.

I appreciate your insights.

So I created this thread a little over 2-and-a-half years ago.

Back then, the Mellanox OFED driver did not support NFSoRDMA according to Mellanox’s support staff’s own admission:

"Our own driver, Mellanox OFED, does not support NFSoRDMA.

Many thanks,

~Mellanox Technical Support"

(Source: Why is NFSoRDMA in CentOS 7.6.1810 limited to 10 Gbps? - Software And Drivers - NVIDIA Developer Forums)

(Albeit that was with CentOS 7.6.1810, but the same was true for Ubuntu as well.)

In fact, if anything, I think that I was trying to run OpenFOAM at the time, and there was a critical error that installing OpenFOAM on CentOS produced that I have it noted in my OneNote file (for CentOS):

“DO NOT INSTALL OpenFOAM!!!”

lol…

As a result of that, I probably switched to Ubuntu 18.04 and because of Mellanox’s then own admission that their own drivers at the time, didn’t support NFSoRDMA (in direct contravention to their marketing claims/materials), that’s probably why I created this thread (because I can get NFSoRDMA with the “inbox” driver that ships with CentOS, but I probably wasn’t equally successful with the “inbox” Ubuntu driver for mlx5_core, (and the rdma kernel modules), etc.

To that end though:

It’s been over 2 years since I’ve tried using Ubuntu and NFSoRDMA.

(My micro cluster is now exclusively running CentOS because I know it works, save for my not running OpenFOAM anymore.)

For the record, I’m using ConnectX-4 cards. (The original, not the EN nor LX versions of that card.)

Mellanox DID eventually re-enable the NFSoRDMA capability in their driver (I think as of version 4.9.x) but because of my experience where Mellanox will just disable those features at will, for seemingly no apparent reason), therefore; as a result of them pulling something like this, I don’t trust their drivers anymore (that they aren’t going to deactive features/functionality from their drivers, again, at-will, for seemingly no apparent reason, in the future).

So, I don’t use their drivers, and I’ve blacklisted their drivers as a philosophical measure.

(The CentOS “inbox” drivers works well enough for what I am doing. The latest tests running ib_send_bw between two Ryzen cluster nodes obtains 96.58 Gbps out of a possible 100 Gbps, which is about as good as I can really get on it without doing a LOT of performance tuning as the workload on said 100 Gbps IB line varies significantly.)

Yeah, I’ve read that in the driver documentation/manual.

Yeah, for the CentOS “inbox” drivers, on the host/server; to enable RDMA, I do this:

To enable NFS over RDMA on host:
host$ sudo vi /etc/rdma/rdma.conf
XPRTRDMA_LOAD=yes
SVCRDMA_LOAD=yes
save, quit

and this:

host$ sudo vi /etc/sysconfig/nfs
RPCNFSDARGS="--rdma=20049"
save, quit

On the client side, I do this:

To enable NFS over RDMA on client:
client$ sudo vi /etc/rdma/rdma.conf
XPRTRDMA_LOAD=yes
save, quit

(Those are from my own notes from my OneNote.)

I haven’t tried running 100 Gbps IB in Ubuntu 18.04 (or any other version for that matter) since I posted this.

(Mostly because a lot of the other programs that I use do not always run in Ubuntu for a variety of reasons. (Too many to list here.), whereas those programs, except for OpenFOAM runs in CentOS, so that’s why I’ve stuck with CentOS on my micro HPC cluster.)

Maybe if I’m bored one day and want a headache, I’ll try this again on my Ryzen nodes. (lol…)

(My old Xeon cluster is decommissioned. Wife says that the system is too loud and gives her a headache.)

I just didn’t know if the set up was as easy compared to CentOS, because in CentOS, you run:

# yum groupinstall -y 'Infiniband Support'

and it will pretty much install everything that you need for it.

And then you edit those files, set up your NFS shares, export them, and then mount them with:

host:/home/cluster /home/cluster nfs defaults,rdma,port=20049,noatime,nodiratime 0 0

and [boom], you’re up and running.

It’s really easy.

I think that when I originally wrote this, Ubuntu wasn’t as easy like this; coupled with the fact that Mellanox took out NFSoRDMA support from their then driver package (again, inexplicably), so I didn’t know if Ubuntu back then had a similar way of getting around the MLNX OFED limitation a la CentOS.

Thanks.

So what are the minimum requirements for attempting NFS over RDMA in the first place? My mikrotik switch does not do ecn but perhaps that’s not needed for RoCE? This wikipedia article would seem to indicate that lossless ethernet is not a strict requirement.

So if I have a couple of RoCE capable ConnectX-3 cards then NFS over RDMA is theoretically possible? It probably wont make too much of a difference but my curiosity is piqued.

Wow that sounds like a mess. Yes I definitely agree the easiest thing is just to use a distro where the inbox drivers support it! IMO Mellanox is just doing planned obsolescence of the older NICs. Which you can hear the argument for but I think most times, its about the money. In reality something like openfoam or ansys in our case sees very little speedup from FDR to EDR IB and the big jump is having RDMA going. That doesn’t sell network cards though.

Edit: I also found Mellanox support useless. They wanted me to pay to update the support for the switch just to send me docs. I did, and they sent me the same docs, but they were just the same as google found and gave zero helpful information. I feel like 90% of the practical side of IT is learned in the field once you get the groundwork theory understood.

To learn which versions were supported I stepped through the latest five versions of every OS and just tried OFED. Let’s just say their matrix leaves much to be desired.

So…that depends, a little bit, on what you are willing to spend.

The Infiniband cards, even older/used ones (for example the 100 Gbps Mellanox ConnectX-4 dual port VPI cards that I am using), still run at minimum, between $350-400 USD per card.

The pricing of cables vary, mostly by length, then type. For example, direct attached copper (DAC) cables I think can be had for as little as maybe $50-100. (Again, varies depending on when you’re looking to buy them, and supply, etc.)

If you need cables that are longer than a few metres, then you’ll need fibre optic cables which can come in both passive and active flavours. I THINK that the last time that I bought, I THINK it was a 100 m active optical cable (AOC) (QSFP28 to QSFP28) was I THINK $160 USD per cable.

So, if you’re only trying to link two systems up together, you can buy two NICs, and an upto $100 1 m long DAC cable and that will be enough to get you up and running.

But that’s also if you’re willing to spend $850 for 100 Gbps.

If, say, you know that you aren’t EVER going to REMOTELY come CLOSE to hitting or needing that kind of bandwidth, then whilst you can get it at a significant discount compared to retail pricing, there is an argument that can be made that you are spending money on bandwidth that you won’t ever be able to use.

At that point, you might be better of with either 10 Gbps (e.g. Mellanox ConnectX-3) or 25 Gbps or 40 Gbps or 56 Gbps. Again, each “tier” has their own pricing levels for cards, cables, etc.

If you want to connect more than two computers to each other, (e.g. three) - you can do it where computer A talks to B, and computer B talks to both A and computer C, but computer A nor computer C would be able to talk directly to each other (i.e. computer B MUST act as the messenger-in-the-middle to pass data between computer A through to computer C).

More than that, then you pretty much NEED to have a switch. I only paid $2950 CAD for my 36-port Mellanox MSB7890 externally managed 100 Gbps Infiniband switch, so again, it depends on what you are willing to pay.

Bottom line:

To run NFSoRDMA, you need at least two NICs, both of which support RDMA.

And a cable that will connect the two cards together.

Stupid question - what’s “ecn”?

If it is about explicit congestion notification - I am not sure if you need ECN for RoCE.

I mean, you can implement it (cf. https://support.mellanox.com/s/article/how-to-configure-roce-over-a-lossless-fabric--pfc---ecn--end-to-end-using-connectx-4-and-spectrum--trust-l2-x), but I’m not sure that’s a requirement for RoCE.

You can consult either the Mellanox driver manual that’s appropriate for your target OS for details in regards to configuring RDMA and/or RoCE.

(RoCE assumes that you are running over the ethernet protocol, instead of, for example, over the Infiniband protocol.)

(I use the Infiniband protocol, not Ethernet protocol as I can assign a IPv4 static IP address via IPoIB as IPv4 is easier for software to work with rather than using the IB GUID or QP or something along those lines.)

RDMA over converged ethernet (RoCE) would need a card that supports RoCE, yes.

But if you just want to say, run NFS over RDMA (NFSoRDMA), you do not “need” to have ethernet (at all).

(Again, varies by implementation.)

For example:
If you have clients that you want to access your servers via NFSoRDMA, then your clients would also need to have that capability as well. And sometimes you might have that, whilst other times, you might not want to spend that kind of money on it.

For example, my Windows systems doesn’t really support NFSoRDMA with the default Windows Mellanox ConnectX-4 driver anyways, so that’s a little bit of a moot point. (I think that the Windows NFS “feature” can only mount the NFS export the “normal” way (i.e. NOT over RDMA)), so enabling it would be a bit of a moot point.)

I forget if the Mellanox ConnectX-4 driver for Windows allowed for NFSoRDMA on Windows. (It’s been a LONGGGG time since I’ve used it/tested it in Windows.)

So, none of my Windows systems has it. I still access the server over just conventional gigabit ethernet.

Conversely though, my HPC clients (compute nodes) along with my system that runs the LTO-8 tape backup system, all of those run Linux, and since they all can make use of NFSoRDMA, therefore; all of those system has that feature and functionality enabled because those systems can actually take/make use of it.

I am trying to be deliberately clear that the whole “converged ethernet” part is NOT needed to deploy RDMA, especially NFSoRDMA. For that, you can run entirely on the Infiniband protocol and it works fine. I’ve had pretty much no problems with it, so long as I am using CentOS. I vaguely recall that I might have had a tiny bit of an issue with Ubuntu not wanting to start up the subnet manager that’s needed for Infiniband to connect whereas with ethernet, you don’t need the subnet manager. But with CentOS, OpenSM subnet manager runs just fine, so I have my micro HPC cluster headnode also act as the subnet manager and that brings up my entire IB network online.

(In theory, if CentOS is NOT the main OS that you want to use, you MIGHT be able to passthrough one of the IB ports to a VM that runs CentOS or maybe Mellanox might have fixed that in newer versions of the driver. I don’t really know, again, as I haven’t tried it in over 2.5 years. Once I got CentOS working, I just stuck with it.)

So the requirement on the hardware side is if you want to run NFSoRDMA, you just need to have a NIC that supports RDMA. RDMA over converged ethernet is not even 100% necessary, unless you want to skip running OpenSM subnet manager altogether. (OpenSM doesn’t take much to run at all.)

In my initial testing, on my dual VPI cards, I can configure the ports to run in ethernet mode instead of in IB mode, and that resulted in about a 1-3% additional overhead.

If you want to just run NFSoRDMA, you can get a couple of “normal” (or original flavour) Mellanox ConnectX-3 cards (i.e. not the EN nor the LX versions), and a cable, and then install the OS of your choice, follow the instructions on how to enable RDMA for NFS on your said specific OS of choice, whether it is with the “inbox” drivers that ships with the OS or with the MLNX OFED Linux driver, and then you can follow your OS specific instructions for deploying NFSoRDMA.

(Again, if you’re using CentOS, the instructions for how to deploy NFSoRDMA is provided above, which comes from my cluster deployment notes.)

It doesn’t really take a whole lot if you want to test it out.

To be able to take full advantage of it though, that also depends a LOT on the kind of hardware that you are connecting those NICs TO:

CPU, RAM, motherboard, availability of free PCIe lanes, are you using NVMe SSDs, SATA SSDs, or HDDs? etc.

Like the “bulk” storage that’s in my micro HPC cluster head node has eight 10 TB SAS 12 Gbps spinners in RAID5 (I think that it’s controlled with a LSI/Broadcom MegaRAID SAS 12 Gbps HW RAID HBA 9341-8i?), so the most that I can pull from that is about 800 MB/s or about 6.4 Gbps, which means that even 10 GbE would have sufficed.

My SATA 6 Gbps SSDs (four in RAID5, I think), doesn’t really fair that much better.

NVMe SSDs would be faster, but then I’d also just wear them out that much faster as well.

So, at the end of the day, it can be really cool to have NFSoRDMA, but if you don’t have the hardware that you would really be able to make use of it, again, it’s something nice to play with, but you might not realise the kind of benefits that you might be expecting, for example, over “conventional” 10 GbE.

But I leave that up to you.

I needed my 100 Gbps because of the type of HPC problems that I was solving. The NFSoRDMA was just the cherry on top.

1 Like

SOMETIMES, the inbox drivers are actually easier to use than the Mellanox drivers.

I think maybe once or twice, I’d use the Mellanox drivers in order to re-flash the firmware so that it wasn’t a Dell firmware anymore, and made it back to the Mellanox “root” part as opposed to it being a Dell OEM branded part, which required Dell to have the firmwares available, etc., which sometimes can be a bit of a PITA.

Yes and no.

I mean, for business that can afford MSRP on their cards, they might not always necessarily care.

I mean, for pete’s sake, some HPC and hyperscalers are running 400 Gbps already. Like…geez…

So as those companies are going through their upgrade cycles, the old hardware gets “dumped” (sold) to liquidators, and I got into running 100 Gbps Infiniband because Linus from Linus Tech Tips made a video about how much cheaper it is now to be able to deploy it vs. when it was new @ MSRP. (Which is true. Like, of course, it’s can be still more expensive, than said a 10 GbE NIC which sports the Intel X550 chipset (although, apparently, the pricing for those aren’t particularly cheap neither - they’re about the same at the $350 price range), so if you’re going to pay the same price for 10 GbE vs. 100 Gbps, the only reason that I can think of as to why you wouldn’t go for the 100 Gbps is because of the cost of the cabling for 100 Gbps, plus the cost of the 100 Gbps switch.

BUT, if you buy the switch ONCE, as a one-time investment, then to bring additional clients online, it’s just the cost of the NIC plus the cable. And then on a $/Gbps basis, 100 Gbps becomes a much more cost efficient option over 10 GbE.

I would suspect that Mellanox never intended on home labbers to be running Infiniband in their apartment/basement. But this is precisely what I’m doing.

In fact, I’ve talked with the IT admins for the HPC cluster at my work, and I was asking them how to they tune the IB performance to be fast for all sorts of different workloads, and we got into talking a little bit and I told them that I am asking because I’m running 100 Gbps IB in my basement, and their jaws dropped.

Pretty interesting when you “scare” the IT department with what you’re running at home.

I think that it depends on what it is that you’re doing, and also how you’re setting it up.

OpenFOAM, probably not as much because CFD applications then NOT to use as much memory compared to FEA applications.

(And even with FEA applications like Ansys, the direct sparse solver is the one that uses more memory compared to their iterative (PCG) solver, which means that the higher bandwidth is probably more noticable with the direct sparse solver than with the iterative, PCG solver.)

But even then, not necessarily all cases/scenarios can make use of said direct sparse solver, and sometimes, you HAVE to use the iterative/PCG solver for your case instead.

I don’t know about the different generations (i.e. FDR vs. EDR), but I would theorise that the EDR should be at least a little bit faster, if for no other reason than it having a lower point-to-point latency.

Fluent, for example, will show, in the network bandwidth benchmark, that it can hit the 10 GB/s (80 Gbps) speeds, even though solving the linear algebra problem is usually only upto about 35% of the entire solution time. (The linearlisation from the second order RANS PDE down to the linear ODE, which can then be solve by their AMG solver is what gets sent, mostly, over IB.)

For the RDMA portion of it, it depends on where you have the working directory and/or _ProjectScratch folder (or if you’re using the Ansys ARC or RSM), it depends on where your cluster staging files are stored along with the scratch disk space. If you have to move data back and forth between stage and/or scratch disk, then yes, NFSoRDMA will help tremendously with that.

But if you’re caching it to local RAM disk (or local SSD on the compute node), then the need for NFSoRDMA maybe a little bit less. (I tied my SATA SSDs together because each disk was only 1 TB, but by tying them together, I can make it into a 4 TB scratch disk space, which without doing that, each of the cluster nodes wouldn’t have had sufficient scratch disk space.)

For Ansys, I think that it will depend on how much time it spends actually solving the cases vs. how much time it needs to spend passing the data around into and out of memory (plus the speed of the CPU itself).

There are also times where (based on my experiences at work), say, I know that throwing more hardware at a single run will only yield marginal reduction in total wall clock simulation time, so sometimes, I would purposely limit a case to use only 4 CPU cores, but then I would run more jobs in parallel, say to complete a large DOE test matrix. Each individual job will take longer to finish, but I would be able to run more jobs simultaneously, will takes less wall clock time, EXCEPT for the very last run in the queue, which, as more resources become available, the last job can’t be expanded on the fly to take advantage of the freed-up resources to accelerate the very last job by throwing all of the hardware at it.

To that end, the individual jobs may not necessarily see a benefit of going from FDR to EDR, but the sum of all of the jobs might. (Again, I don’t know. It’s just a hypothesis from my end because I only have experience with EDR IB.)

Yes and no.

There’s a LOT of googling for each issue that I came across when I was doing my initial deployment and testing.

Sometimes, it’s asking people in the respective forums because either other people have tried it before, so you don’t necessarily need to “re-invent” the wheel so to speak, and other times, you’re just wayyyy off “normal” that people just think you’re crazy and colouring WAYYYY outside the lines.

(I mean, how many people use 100 Gbps IB on their Ryzen 9 5950X systems, and run those nodes completely headless? I do. lol…)

My theory would be that when they’re testing stuff, they either don’t have a “big” testing budget, relatively speaking, to be able to test different permutations, or that their testing isn’t so automated (i.e. more manual involvement, and probably more than they would like to have).

On the other hand, I can also imagine that given the list of capabilities that these cards have or can have, it’s difficult to be able to test everything, permutatively just because there’s so much to it.

I’m AMAZED that they get a working product shipped out the door, along with working drivers considering what the hardware is capable of and the drivers that have to enable it.

Thanks for the detailed response, @alpha754293!

So I have a few ConnectX-3, intel X520 cards and a 10Gbps mikrotik ethernet switch. Performance is great with that setup on my home lan. I’m definitely not going to be investing in infiniband equipment but never say never I suppose…

Sounds like I might be able to experiment with RDMA with this very basic equipment but my cheap switch is lacking data center bridge features like prioritized flow control that I understand are necessary for RoCE. But I’ve also read there is a version of RoCE that doesn’t require transport level delivery guarantees at the cost of some extra handshaking performance. I know very little about this so I appreciated your detailed response!

As for microsoft, Windows for workstation supports SMB over RDMA. Samba has some new experimental support for this so that was something I wanted to play with at some point. Even without RDMA, however, file transfers are over 500 or 600 Mbyte/s so not to shabby.

mith, maybe you mean Soft RoCE?

Older IB stuff is cheap and you can do point to point just two NICs and OpenSM running. You can go between two machines that way and the cable might be the most expensive part.

With the Samba stuff, SMB direct is impressive. Newer distros support it as a client, Windows Server is cakewalk. You can get it going as a server on Linux with ksmbd. Ubuntu 22.04 comes with it as a compiled package.

With SMB multi channel I can only get about 900MB/s in SMB and using a lot of network and CPU. RDMA is the way to go. From a windows server to a Linux client I get max transfers basically at the line rate.

I am working on getting ksmbd going right now. It is bit more convenient if you have mixed windows and Linux machines to either unify to SMB Direct or have both NFSoRDMA available from a single server especially if the distros you are using don’t have SMB Direct sorted out which is the case with a lot of Linux distros support has been experimental and disabled in the kernel. ksmbd is new, a bit buggy. Trying to sort it out now.

Wasnt thinking about soft-roce, this wikipedia article on RoCE triggered me:

Link Level Flow Control: InfiniBand uses a credit-based algorithm to guarantee lossless HCA-to-HCA communication. RoCE runs on top of Ethernet. Implementations may require lossless Ethernet network for reaching to performance characteristics similar to InfiniBand. Lossless Ethernet is typically configured via Ethernet flow control or priority flow control (PFC). Configuring a Data center bridging (DCB) Ethernet network can be more complex than configuring an InfiniBand network.[18]

Maybe I’m reading too much into this statement, but it makes it seem that a lossless transport is not necessarily a requirement for RoCE? There’s also this https://mymellanox.force.com/mellanoxcommunity/s/article/howto-configure-roce-over-a-lossy-fabric–ecn–end-to-end-using-connectx-4-and-spectrum–trust-l3-x

I just don’t know how crucial the datacenter bridging features are, though. Is it just a matter of performance or will things simply not work without it. Infiniband is starting to look a lot simpler. I cant believe I just said that. Damn you all…

Life on the bleeding edge. Usually have to build from source until the distros sort it out.

Edit: Just found this. https://support.mellanox.com/s/article/introduction-to-resilient-roce—faq

Perhaps RoCE over lossey is a Mellanox only thing? Anyway, looks like Connect-4 is a minimum requirement.

Edit 2: I rebuttal to Mellanox and RoCE in general by chelsio: https://www.chelsio.com/wp-content/uploads/resources/Resilient_RoCE.pdf They rip mellanox to shreds but it does explain what resillient RoCE is better than mellanox themselves.

So now I guess I’ll have to investigate iWARP

I got it working in RHEL 9 today; yes 100% right had to build from source. Ubuntu I am waiting on their tech support as I butted against a bug that was fixed eight mos ago according to notes in git. The built-in package seems dated.

Ubuntu 22.04 still has the SMB Direct switch flipped off . In spite of advertising that SMB Direct client was a new feature. I could try recompiling my own kernel, and then try to solve whatever else they haven’t done yet (the reason it was disabled), but I am lazy. It seems strange to have the ability to do SMB Direct as a server but have it disabled as a client. While that won’t matter to this machine, since it is the server, why have server mode available for SMB DIrect but not a client? I can see that being annoying later.

To a alpha754293 points, yes that is a fair call with the different workloads and solvers. I agree.
Going off the deep end a little:
In the case of our process, fluent solving is scripted, and we move the mesh files into the root node over RDMA to its local NVMe. The solve runs directly from that, so we aren’t moving big data to the nodes just whatever data flows during clustered solves. After the solution completes the result set is generated, and then a script transfers it back over rdma channel. Those are pretty big files but not crazy big. We had to weigh what performance we lost by going down to FDR IB line rate vs using the x16 slots on the dual Epyc motherboards for GPU solving. GPU solution won by a considerable margin. We only had one x8 slot remaining so CX3 and FDR it was. Single GPU was no good, we only saw solid gains in solve times with one GPU per CPU.

We bog down on file transfer performance (without RDMA) in the generation of the models and CAD manipulation of them. For example, each CAD file can be a few gigabytes, a bit clunky without RDMA but can be OK but for sure a gigabit line is painful. Workstations workflows do benefit from it.

The significant gains came from two of our jobs, all the data, CFD result sets, customer iterative CAD data, have an archival system. After a certain timespan of a customer data being “unused”, an automated script moves it from the NVMe raid over to the platter disk storage array. File size is terabytes, hundreds or thousands of tests. If I go on the road, I need to move some current projects to the mobile workstation, but I want to have that transfer done in minutes not half a day. There is another machine running an AI workload doing data analysis on that archive, so the fast connection is important then as well.

The second essential workflow is 3d scan data. If you work with, say full resolution 0.2mm 3d scan of a car. That will be 100-200GB file, and that workflow is constantly saving before each step in data processing.
Consolidating it from multiple local file stores was a pain for us and wasted a lot of space. The software also crashes a lot, so you save a lot. Fast storage has cut many hours from that job as well as a centralized incremental backup system for current work.

mith:
Please let us know how you go with iWarp, sound like it could be promising.

I am not sure if prioritized flow control is absolutely critical to the deployment of RoCE.

I suspect that it probably would help, but I don’t know if it is a hard-and-fast requirement.

(I think that the examples that Mellanox gives for the Mellanox OFED drivers DO show them enabling PFC on the Infiniband switches, but I am not sure if you were to skip those steps because your switch doesn’t have those capabilities – what kind of an adverse impact it may or may not have on RoCE performance.)

That might be the difference between RoCE v1 vs. RoCE v2.

Admittedly, I am not an expert in regards to RoCE (because I don’t run ethernet on my IB cards, even though I can set the ports to run in ETH mode rather than IB mode, but I don’t have a 100 GbE switch ($/port or $/Gbps to run 100 GbE is higher than running 100 Gbps IB, so I went with IB).), so I probably won’t be super helpful having never done it myself.

On the IB side, to enable RDMA is quite simple. Your software (if you’re using it for a computational application for example), would need to be able to use it; or, in the case of NFSoRDMA, for example, the deployment instructions for CentOS has been posted above (I think).

For other RoCE applications, unfortunately, I just don’t have any experience with that.

Sorry.

There IS SMB Direct, but if I recall correctly, the SERVER would need to be running either Windows (Professional or Enterprise) or Windows SERVER.

There are instructions that you can find via google on how to deploy that.

The problem that I have with it is that the storage backend on Windows Server SUCKS. WSS sucks.

CentOS with the volume formatted as XFS and then exported to the network via NFS (or NFSoRDMA) – I get better performance with that.

My current TrueNAS server doesn’t have a free PCIe 3.0 x16 slot for an IB card, otherwise I’d drop that in as well and enable RDMA in TrueNAS as well.

But also admittedly, my other storage servers are QNAP NAS units. It’s easier to use them, out of the box, then to try and set it up myself. (I was getting user permissions conflicts from my CentOS cluster headnode; and I’m just too lazy/it’s too much of a headache to resolve, so I just pass the data through my QNAP NAS where I DON’T have this issue. Sadly, to be able to get 100 GbE on a QNAP NAS, you have to buy almost their most top-of-the-line server, which is really expensive, so I don’t bother with that.)

The cards will likely still be the most expensive part. My Mellanox ConnectX-4 cards were $350 each (roughly). I think that the DAC cable was maybe $80. Something like that.

But yes, to go in between two systems, it’s not too bad.

It’s when you need to expand beyond two systems, that’s when it can start getting more expensive.

But there’s a company - I think it’s called T.E.S. Solutions - they’re I think based out of Haifa, Israel that resell used Mellanox hardware.

They have pretty decent pricing.

(https://www.tes-itsolutions.com/)

If you reach out to them, they might be able to help you beat the lowest price that you can find on eBay. The name of my contact there is Itzhak Finkelstein. If you DM me your email address, I can send you his contact information and he might be able to put together like a “kit” for you that might be able to beat the best prices on eBay.

I’ve purchased cards from him, and it didn’t take that long for it to be shipped, I think it was via DHL (which DHL tends to be better for shipping electronics - not like FedEx and DEFINITELY not like UPS).

It’s worth a shot.

The same goes for anybody else who might read this thread in the future - if you are looking for Mellanox IB hardware, Itzhak Finkelstein might be able to assist you with that process.

(Disclaimer: I’m not affliated with them nor do I have an affliate link with them. I’ve just been their customer before and they were good to work with.)

(I’ll have to read and respond to the rest later.)

We buy from the same place T.E.S. they have good pricing for sure. I was thinking ConnectX-3 cards when I was talking about inexpensive options, I think I saw some on T.E.S. for $40 or so, that could get you up to 56gbit without much money.

Yeah…it depends on the absolute cost vs. $/Gbps.

And yes, even 4x FDR IB (56 Gbps) is still PLENTY fast.

I don’t know if you’d be able to run those ports in ETH mode (I am assuming you can), but I would also thinking trying to find 56 GbE switches would be a LOT harder than trying to find, say, 100 GbE switches (or 100 Gbps IB switches).

That’s the only “gotcha” that I can think of if one were to deploy 4x FDR (56 Gbps).

I scored a Mellanox SX6012 (12 port 56 Gbps) for ~$200 last year, a bunch of Mellanox ConnectX 3 Pro NICs (dual-port 56 Gbps) and with the proper cabling can run all of these at 56 Gbps speed (about $100 per NIC+cable).
Overall cost was more that a “normal” person would spend on networking gear, but not out of this world for an enthusiast budget.

It took me considerable time understanding the software side of things. This was mainly caused by the fact that most guides are old (>5 years old) and focused on setting up and tuning the infiniband stack, which I don’t use because I have mostly TCP only NICs. Linux (I use Fedora) has improved nicely, and does not require 3rd party drivers installed. Just pulling in the correct software packages (check Fedora and RedHat documentation) will enable ROCE without any additional intervention (check with “rdma link”).

I noticed that just plugging in the cables will yield just more than about two times 10G speed over TCP network. That seemed unacceptable to me as the goal for my investment was to get close to 5x 10G speed. Spending hours on the internet I could not find actual performance test numbers cited by anyone for 56Gbps gear.

After much troubleshooting I had to console myself to the fact that “close to 5x 10G speed” will be largely theoretical in my home lab for the foreseeable future due to a list of reasons:

  • Generally apps don’t use ROCE as RDMA requires using a special API (as stated above). E.g. iPerf3 tests don’t use ROCE interface and will never show the full technology potential. For that you have to use RDMA specific test tools that come with the libibverbs package or similar. The observed difference in achieved bandwidth is ~4 GB/s (iPerf3) to ~6GB/s (qperf).
  • Most people talk about using 56 Gbps gear for network storage. Enabling NFS over RDMA could not be simpler in Fedora: simply add “-o rdma” to the mount command. That’s where simple stops. Achieving >10 Gbps storage speed is surprisingly hard despite having multiple NVMe drives both on server and client machines. Do local speed tests first! I’m still working on that, but this is actually not my main use case.
  • Max bandwidth can only be achieved by aligning the stars (or more specifically tuning the RDMA/tcp stacks. ibv_devinfo will show the RDMA mtu value that is different from the tcp mtu values and I still to this day have not yet understood the relationship between them. On top of this the RDMA mtu value is dependent on the TCP mtu value of the card. The difference is quite noticeable: out of the box 1500 (tcp) / 1024 (roce) mtu will test at about 4GB/s with qperf, my X99 Intel based gear will yield max bandwidth tuned to 4200 (tcp) / 4096 (roce) about 6GB/s with qperf. My AMD 5900x based gear will reach max bandwidth with 9600 (tcp) / 4096 (roce) on the client (just shy of 6GB/s) - the switch only allows a max mtu of 9216. I don’t understand yet, why my Intel and AMD hardware requires different tuning of the software stack (using identical NICs) to reach max results. Also, I don’t understand yet how the mtu values in ROCE, TCP on client and on the switch and on the server interact. Logically, the best performance should be achieved when all of these align so that packages don’t have to be “repackaged” on their way…
  • ConnectX 3 Pro cards are connected to the motherboard via 8-lane PCIe Gen3: a total of 64 Gbps theoretical bandwidth. I found out the hard way that a 8-lane PCIe Gen3 slot connected to the chipset (although chipset is connected via 8-lane PCIe Gen4 slot) will not offer the full bandwidth. I am using the ASUS Pro WS X570 ACE motherboard because it offers the most usable way (for me) to access all of the PCIe bandwith offered by Ryzen 5 platform. I use 16 lanes PCIe Gen4 directly connected to the CPU in form of two 16 lane bifurcated PCIe slots, which allow to either add 4 Gen4 NVMe SSDs or a graphics card and two Gen4 SSDs. A third slot with 8 lanes PCIe Gen4 is connected to the 570 chipset, which in turn is connected via 8 lane PCIe Gen4 to the CPU. I prioritize local storage performance over network performance and add the 8 lane PCIe Gen3 NICs into the third slot connected to the chipset. In this configuration qperf performance would not improve over about 5GB/s. This puzzled me as the 8 year old Intel gear would should higher performance. Only changing the NIC into a direct CPU-connected slow would yield comparable performance (~6GB/s).

Now I feel I have solved the mystery of disappearing bandwidth in my setup:

  • 56Gps advertised
  • ~48Gbps max (~6GB/s) observed in synthetic tests in ideal situations (using qperf)
  • ~33Gbps max (~4GB/s) observed in synthetic tests using non-RDMA testing software (iPerf3)
  • practical NFS based tests (inconclusive due to questions on software and hardware setup) currently don’t show much improvement over 10Gbps networking.

With this done, I can turn my attention to tuning the network stack towards more practical use. I noticed that tuning the network stack towards max bandwidth resulted in ever increasing latency.

Default configurations (1500/1024 mtu) will show ~10nsec latency in RDMA test tools (qperf), increasing to 50-60 nsec when tuned to max bandwidth.
Regular tcp application will use the full tcp stack and show similar latency to 10Gbps networking.

Fun stuff…

If anyone observed considerably different performance or has recommendations to tuning I am very interested :slight_smile:

1 Like

A couple of questions and comments for you:

  1. You mentioned that you’re using ConnectX-3 Pro cards which are dual VPI port carts.

Why are you using RoCE instead of IB RDMA and run IPoIB?

  1. From what I can tell, your Mellanox SX6012 switch is an IB switch.

Therefore; as such, I am still wondering why are you trying to run RDMA over converged ethernet (RoCE) rather than just running “straight” IB RDMA?

Yeah, I don’t have enough NVMe SSDs to be able to test it. The U.2 NVMe SSDs are just too expensive for me to be a wear component due to the finite write endurance limit. (Even if it can handle 1 DWPD or even 3 DWPD, for HPC applications, it is very easy to burn through that, very quickly, especially given the write performance of the drives that can be achieved.)

Therefore; as such, and perhaps ironically, my 100 Gbps IB is connected to spinning rust where I barely see > 6 Gbps (~800 MB/s) write speeds to the array, which is well below what each drive should be capable of. (I am using eight HGST 10 TB SAS 12 Gbps 7200 rpm HDDs, so each drive should be capable of upto about 200 MB/s.

re: MTU
Yes, configuring and optimising the MTU on IB hardware can be a lot more challenging than it is to optimise the MTU for “conventional” ethernet hardware.

I think that my IB cards are using the datagram mode instead of connected mode, and as such, I think that my MTU is capped off at either 2044 or 2048 bytes. I forget. But it’s quite low, relatively speaking, even though the card should be able to hit 4096 bytes.

And then if I want to be able to use the absolute max MTU of 9216 bytes, I would need to switch it from datagram mode to connected mode, and then that brings with it, a new set of challenges where you are supposed to define the queue pairs (QP) I think more explicitly in a table somewhere so that it would be able take advantage of that.

I don’t bother with that.

I think that tuning the IB for performance works if you can expect or anticipate only a specific type of traffic that will traverse through the line. But if the histogram of the size of your files and the number/distribution of it is like an average computer where you have quite a mix of smaller files with a bunch of larger ones, then I have found that it is virtually impossible to tune the IB network that will give peak bandwidth for ALL ranges of file sizes rather than it being biased towards one or the other.

So, end net result - I don’t bother tuning my IB network at all.

re: PCIe lane management
Yes, this is a HUGE deal.

On my Asus ROG Strix X570-E Gaming WiFi II, if I were to plug in my Mellanox ConnectX-4 100 Gbps IB card into the second PCIe x16 physical/x4 electrical slot, I will only get about 14 Gbps using ib_send_bw.

But if I take out the discrete GPU, and run the system completely headless and then plug said ConnectX-4 into the primary PCIe 4 x16 slot, then I am able to get full bandwidth on it.

So yes, it is very important that you manage your PCIe lane availability.

But yes, sadly, to be able to use, really, even 20 Gbps or 40 Gbps out of the 56 Gbps that’s theorectically available, it is not a trivial task.

It looks like that I don’t have the results for NFSoRDMA using RAM drives between my micro HPC cluster nodes when I used to be able to do it with GlusterFS 3.7 anymore.

But I think that I might have tested it once between two Windows clients (also using software RAM drives), but I don’t remember what kind of transfer speeds I was able to obtain now.

But yes, it is actually quite difficult to be able to make use of all of that theorectically available bandwidth.

Interestingly enough though, for the price that you pay for IB hardware, the Intel X550 10 GbE NICs aren’t that cheap neither. It’s actually pretty close to what I paid for my 100 Gbps IB NICs. If you’re going relatively short distances, you can probably save a LOT of money going with Cat 6A ethernet cables than DACs or AOC cables.

But then again, I don’t use my 100 Gbps IB just for storage. I actually have HPC applications that can take advantage of the RDMA that it offers and THAT has a HUGE impact on those types of applications. The ability to use it for storage is just an added bonus at that point.

1 Like

Well, I actually have the ethernet only versions of these cards. Mellanox MCX314A-BCCT. I think somehow these cards can be cross-flashed, but I am too chicken to try it. Also, I am not sure about what benefits I could expect.

The switch can act as IB only, Ethernet only, or VPI, where IB/EN can be specified on the port level.

Yep, that’s what actually got me to get all of this gear. I started with buying just two Aquantia 10Gb NICs and found these to be working great. But 10G switches were/are relatively expensive ($100+/port).
When I found out that for about the same amount of $$$ I can get “5x” the performance I did not hesitate :slight_smile:

Right now I sort of finished the “let’s figure out the underlying technology” phase and am getting ready to dive into some HPC-type applications I have in mind.