Does anybody here have experience setting up NFSoRDMA in Ubuntu 18.04 LTS with the "inbox" driver?

alpha754293 · January 27, 2020, 6:14pm

I’m trying to set up NFSoRDMA in Ubuntu 18.04 LTS using the rdma-core, mlx5_ib, etc. packages.

nfs-kernel-server has already been installed on the server and nfs-common has already been installed on the client.

I’ve confirmed that my Mellanox ConnectX-4 card is up and running via lsmod, lspci, ibv_devinfo, and ibstat.

ifconfig shows that an IPv4 address has been assigned and ping and rping works.

opensm is working (confirmed via ps aux | grep opensm).

When I try to mount:

# mount -v -o rdma,port=20049 server:/export/home/cluster /export/home/cluster

It says that the connection is refused.

I have no idea how to verify which version NFS comes with nfs-kernel-server and/or nfs-common. (e.g. whether it’s NFS v3, v4, or v4.1)

I also tried looking up the guides on how to update /usr/lib/systemd/scripts/nfs-utils_env.sh, to no avail.

Yes, I know that I can install the MLNX_OFED driver, but until relatively recently, Mellanox, in their infinite wisdom, decided to take out NFSoRDMA support from their v4.6 version of the drivers. (v4.7 has it back in now, but I want to set it up with the “inbox” drivers in case Mellanox decides to pull it from their drivers again).

Any help and/or suggestions would be greatly appreciated.

Thank you.

gordonthree · December 30, 2021, 10:56pm

Fingers crossed, digging up a two year old thread … @alpha754293 did you get this working? I just received two x3-pro cards and want to try the same nfs over rdma.

alpha754293 · January 5, 2022, 6:14am

So I’m not sure if you are the same person who asked the question on the Mellanox community/forums as well (cf. https://community.mellanox.com/s/question/0D51T00007yiLFA/lts-ofed-support-for-nfsordma-on-lts-ubuntu), but on the assumption that you’re not the same person, then you might want to check there first.

If you are, I read the thread and they gave, once again, another less-than-intelligent answer.

(I had previously busted Mellanox for false advertising because they removed NFSoRDMA on their ConnectX-4 cards despite their advertising material saying that it supported NFSoRDMA from Mellanox’s own MLNX_OFED_LINUX drivers.)

And per that thread, I am not sure if you’ve read the release notes for that version of the MLNX_OFED_LINUX driver (see here: General Support in MLNX_OFED - MLNX_OFED v4.9-4.1.7.0 LTS - NVIDIA Networking Docs), but it says that it does still support NFSoRDMA in Ubuntu 18.04.3.

So, I’m not sure if you’re in a position where your cluster can use that specific version of Ubuntu with that specific version of the OFED LTS driver, but it would appear, at least on the surface, and according to the documentation, that that might be one of your possible solution paths.

To answer your original question though, I’m checking through my OneNote and it does NOT look like that I ever got NFSoRDMA on Ubuntu 18.04.1 LTS to work. But then again, I was also using Mellanox’s 4.5-1.0.1.0 driver instead of 4.9, so…that might also have something to do with it as well.

There should be a package mlnx-nfsrdma that’s installed with that version of the driver, so maybe that might work for you.

Unfortunately, I run CentOS pretty much exclusively (back when I was learning about how to setting all of this stuff up) and because it worked for me then with CentOS 7.6 (and now it’s worked for me up through and including CentOS 7.9, but I still stick with CentOS 7.7 for compatibility reasons), I haven’t tried it with Ubuntu at all since then.

I MIGHT play around with it within the next month or so, but being that I also have a slightly newer card (ConnectX-4 vs. your ConnectX-3), so there is also a real possibility that what might work for me may not necessarily work for you due to this difference as well.

Let me know if you have any more questions and I will do my best to try and help.

Thanks.

edit
Note: It should be noted that I ended up NOT using (or NOT trying to use) Ubuntu’s “inbox” Infiniband driver and was, instead, using Mellanox’s drivers instead.

BUT, having said that, you MIGHT try to see if the Ubuntu “inbox” Infiniband drivers might work for you because I use CentOS’s “inbox” Infiniband driver because Mellanox, in their infinite wisdom, at the time when I was learning about this, decided to remove NFSoRDMA inexplicably from their driver (which like I said, I called them out on false advertising on their website and I was using v4.5 back then, and they put it back in somewhere around v4.7, but by then, I was so used to using CentOS’ “inbox” drivers, that I didn’t really need Mellanox’s drivers).

So, SOMETIMES, you might potentially have more luck with Ubuntu’s drivers (if they have it) than with Mellanox’s drivers. And this is one of the reasons why I don’t use Mellanox drivers because they will take away features or support for features without rhyme nor reason, so I’ve categorised them as someone who’s not trusthworthy (that they won’t take away the feature again in a future version of the driver).

Hope this helps.

gordonthree · January 5, 2022, 11:17pm

Nope, not me on the Mellanox forum… I had x-posted this to Reddit, as well as STH forums.

I am using the drivers supplied with the Linux kernel, which perhaps is part of my problem but I’m not certain. Is that the “inbox” driver?

I have been able to get NFSoRDMA working between certain nodes on my network, using Mellanox hardware, as well as SoftiWarp and RXE (Soft RoCE) on generic Intel 10g hardware. However the results are not repeatable, I mean, I can’t take the exact same steps to connect “client1” to “server1” apply those same steps to connect “client2” to “server1”, despite all running identical hardware and OS/software.

I’m curious to try hardware iWarp but hesitant to buy the cards, modern ones still being rather expensive, and older ones being rather unsupported.

I’ve been working with Fedora 35 as the OS on all my machines. I have also tried Rocky 8.5 as an NFS client, and had success with SoftiWarp and RXE.

modinfo mlx4_core
filename:       /lib/modules/5.15.12-200.fc35.x86_64/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko.xz
version:        4.0-0
license:        Dual BSD/GPL
description:    Mellanox ConnectX HCA low-level driver
author:         Roland Dreier

lsmod |grep mlx
mlx4_ib               221184  0
mlx4_en               135168  0
mlx4_core             372736  2 mlx4_ib,mlx4_en
ib_uverbs             167936  2 mlx4_ib,rdma_ucm
ib_core               417792  13 rdma_cm,ib_ipoib,rpcrdma,mlx4_ib,ib_srpt,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,siw,ib_cm

lsmod |grep ib
mlx4_ib               221184  0
mlx4_core             372736  2 mlx4_ib,mlx4_en
ib_srpt                69632  0
ib_isert               57344  0
iscsi_target_mod      360448  1 ib_isert
target_core_mod       430080  3 iscsi_target_mod,ib_srpt,ib_isert
ib_iser                49152  0
libiscsi               73728  1 ib_iser
scsi_transport_iscsi   143360  2 ib_iser,libiscsi
ib_umad                40960  0
rdma_cm               131072  5 rpcrdma,ib_srpt,ib_iser,ib_isert,rdma_ucm
ib_ipoib              151552  0
libarc4                16384  1 mac80211
ib_cm                 143360  3 rdma_cm,ib_ipoib,ib_srpt
ib_uverbs             167936  2 mlx4_ib,rdma_ucm
ib_core               417792  13 rdma_cm,ib_ipoib,rpcrdma,mlx4_ib,ib_srpt,iw_cm,ib_iser,ib_umad,ib_isert,rdma_ucm,ib_uverbs,siw,ib_cm

alpha754293 · January 23, 2022, 8:20am

Sorry for the delay. I don’t always see the email notifications that someone has responded to this thread.

Unfortunately, I don’t have any experience with Fedora 35, so I might not be able to give you specific or exact instructions on how to deploy this, step-by-step.

But, let me maybe ask a few probing questions:

(I am running CentOS and they have a package group called Infiniband Support which will install a bunch of Infiniband-related tools. Not sure what Fedora 35’s version of that is called, but you might be able to google it. If not, I’ll try and see if I can get my CentOS system to dump out what’s all installed in the Infiniband Support group so that I can give you a list of packages that you can then manually try and find if Fedora 35 didn’t package them together for you.)

Are you running an Infiniband switch? If so, what make/model? Is it a managed switch or is it an externally managed switch? (I ask this because I am running a Mellanox MSB-7890 36-port 100 Gbps Infiniband switch and that’s externally managed, meaning I need another system running Linux, running the OpenSM subnet manager to get my IB up and running.)

If you are running a managed switch, consult the switch’s owner’s manual/user’s guide on how to make sure that you have one, maybe upto two subnet managers running. (I only use one, but I know that Mellanox’s managed switches can run up to two subnet managers, but if you have more running, that can cause problems for the switch itself.)

If it is an externally managed switch, you can check that on your Linux systems.

If you were run sudo iblinkinfo, does it report back what you expect it to report back?

(i.e. all links to and from the switch are active and up?)

If (2) doesn’t work, then try running sudo ibstat and see if the State: of the port is Active and also the Physical state: when it says “LinkUp” or not.

I forget what it says when it is waiting for the subnet manager to come online but I’ve had it happen where all of the rest of my clients were booted up and ready before my cluster headnode was booted up and ready, and the headnode runs the opensm so all of the client nodes were in like a “standby” mode until the headnode comes up. And once the headnode boots up and opensm is running, then all of the clients come up on the IB network as well.

If your card is running, and you have no problems with opensm running, and your switch is running, then the next thing that you would need to do is to make sure that RDMA is up and running on your systems.

In CentOS, I would enable rdma by typing:

sudo chkconfig rdma on

And also you can check the status of it by typing:

sudo systemctl status rdma

Once you have confirmed that it is working, in CentOS, I would have to edit /etc/sysconfig/nfs where it says:

RPCNFSDARGS=

to

RPCNFSDARGS="--rdma=20049"

I would also have to edit /etc/rdma/rdma.conf on the host to read:

XPRTRDMA_LOAD=yes
SVCRDMA_LOAD=yes

and on the client, confirm in /etc/rdma/rdma.conf that it says:

XPRTRDMA_LOAD=yes

Your NFS export tab would be a normal NFS export:

/etc/exports:

/path/to/export/folder * (rw,[options])

whatever options you normally use for your NFS exports.

And then I would restart NFS using the command (run the following as root):

service nfslock stop; service nfs stop; service portmap stop; umount /proc/fs/nfsd; service portmap start; service nfs start; service nfslock start

On the client side, NFS mounts in /etc/fstab looks like this:

server:/path/to/nfs /mnt/point nfs defaults,rdma,port=20049,[options] 0 0

Whatever other options you may have for the mounts.

And then in my notes for my CentOS deployment, I have the lines:

sudo chkconfig nfs on

and also

sudo chkconfig nfslock on as well.

You can then reboot your server/host first and wait for that to all come up before rebooting your client an then you can re-run some of the IB diagnostics to see if it has made a successful connection to the switch and from your client to server by either using regular ping or using ibping.

(Regular ping is sometimes more telling than ibping because if normal ping doesn’t work, then there might be something that has to do with incorrect and/or improper IPv4 address assignment.)

You’ll want to watch to make sure that the network services started up properly for you and that there were not failures nor error messages that popped up during boot for either your server and/or your clients.

If everything works well, then you should be able to create a dummy text file on your server to see if the clients can see it and then see if your client can rm that dummy text file and create a new, dummy text file.

If there are no access issues, then the next thing that you can do is to try and see if you can send a decently large file over (e.g. 10 GiB file if you have one) (and you want a decent large file) because you want there to be enough time to see if your GUI system monitor is going to see the network traffic or not. (With RDMA, you shouldn’t see the network traffic in system monitor because that’s kind of the whole point of RDMA (bypassing the network stack).

rsync can be useful because it can still report the transfer rate over RDMA. (Or if you are just using the GUI for a normal copy operation, that should still be able to tell you what the transfer rate is and system monitor still won’t show it despite that. Or at least this is true in CentOS 7.7.1908.)

Not sure how these steps will work for you in Fedora 35, but if you find the Fedora 35 “version” of these steps, then you should be able to get your NFSoRDMA up and running on Fedora 35.

(Sidebar: I’ve found for example, that I can get CentOS to be the NFSoRDMA server, whereas I couldn’t get CAE Linux 2018 (which was originally built on Ubuntu 16.04 LTS) to be the same, NFSoRDMA server.

One of errors that I’ve also encountered, I forget if it was with Ubuntu or some distro other than CentOS which complained that the RDMA port 20049 for the NFS mount option in /etc/fstab wasn’t a valid option. (Might have been on TrueNAS Core 12.0 U1.1? I forget.)

But either way, my point being is that you might run into a problem like that where it doesn’t like the mount port, and/or that your specific distro’s version of the NFS client does NOT have RDMA available as an option (if you are using the “inbox” NFS drivers and Infiniband drivers.)

(This is one of the reasons why I ended up picking CentOS 7.7 to be my OS of choice because it met these functional/operational criteria, despite it being an older OS. But you can always upgrade the kernel from elrepo-kernel or do other stuff to bring the OS up to date with other distros like Fedora 35 unless there is something that you need out of Fedora 35 that you can’t get (or it’s harder to get (working)) in CentOS. YMMV.)

Hopefully this helps.

Thanks.

gordonthree · January 25, 2022, 10:06pm

Thank you for your detailed reply!

The infiniband support package is the same under Fedora, with all the ib* tools in it. Most of the tools return an error indicating there is no infiniband network present. I found a posting on a network mailing list from 2019 that indicates most of the ib* tool stack does not work for RoCE.

I do not have an infiniband switch or a managed switch that supports infiniband or PFC / QoS. I am using a basic layer 3 managed switch, an Aruba 5406R ZL2. I guess that’s part of the unknown for me, are RoCE and Infiniband the same thing? I don’t have Infiniband cards either, or maybe I do? The cards I bought are ConnectX-3 Pro EN (which I believe are firmware locked to 10g Ethernet, instead of 40g infiniband?) I don’t have any Infiniband subnet managers running, that I’m aware of.

There’s no distinct rdma service any more as far as I can see. If udev and systemd detect an RDMA capable card installed, certain changes are automatically made (and that’s probably a bad thing in my case since I don’t know what’s going on.) This post on reddit links to several of the changes that aren’t reflected in documentation for RDMA https://www.reddit.com/r/HPC/comments/p46et0/centos_84_does_not_provide_rdmaservice_and/

Same goes for svcrdma and xprtrdma. Those modules have been replaced with a module named rpcrdma, and it’s loaded and active. The rdma port is also already setup in rpc.

I’ve kinda given up at this point. None of my storage at home is fast enough to challenge the CPU, where RDMA offers the most benefit. Even reading and writing from a consumer level nvme SSD is barely a blip for the CPU.

I do appreciate your responses, sooner or later I’ll revisit this, maybe when datacenter grade nvme shows up cheap again on ebay.

alpha754293 · January 26, 2022, 3:33am

Ahh…that makes sense.

No, it’s not.

Remote direct memory access (RDMA) over converged Ethernet is different than Infiniband.

So here is how you can check:
If you have the “Infiniband Support” group of tools installed on your OS, then you should have a tool called mstconfig.

If you run lspci | grep Mellanox, you should see something like:

04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

The 04:00.0 is going to be the device ID that you are going to use for the mstconfig command:

sudo mstconfig -d 04:00.0 q

and it should output something like this:

Device #1:
----------

Device type:    ConnectX4
Name:           N/A
Description:    N/A
Device:         04:00.0

Configurations:                              Next Boot
         MEMIC_BAR_SIZE                      0
         MEMIC_SIZE_LIMIT                    _256KB(1)
         FLEX_PARSER_PROFILE_ENABLE          0
         FLEX_IPV4_OVER_VXLAN_PORT           0
         ROCE_NEXT_PROTOCOL                  254
         NON_PREFETCHABLE_PF_BAR             False(0)
         NUM_OF_VFS                          8
         FPP_EN                              True(1)
         SRIOV_EN                            False(0)
         PF_LOG_BAR_SIZE                     5
         VF_LOG_BAR_SIZE                     1
         NUM_PF_MSIX                         63
         NUM_VF_MSIX                         11
         INT_LOG_MAX_PAYLOAD_SIZE            AUTOMATIC(0)
         SW_RECOVERY_ON_ERRORS               False(0)
         RESET_WITH_HOST_ON_ERRORS           False(0)
         CQE_COMPRESSION                     BALANCED(0)
         IP_OVER_VXLAN_EN                    False(0)
         PCI_ATOMIC_MODE                     PCI_ATOMIC_DISABLED_EXT_ATOMIC_ENABLED(0)
         LRO_LOG_TIMEOUT0                    6
         LRO_LOG_TIMEOUT1                    7
         LRO_LOG_TIMEOUT2                    8
         LRO_LOG_TIMEOUT3                    13
         LOG_DCR_HASH_TABLE_SIZE             14
         DCR_LIFO_SIZE                       16384
         LINK_TYPE_P1                        IB(1)
         LINK_TYPE_P2                        IB(1)
         ROCE_CC_PRIO_MASK_P1                255
         ROCE_CC_ALGORITHM_P1                ECN(0)
         ROCE_CC_PRIO_MASK_P2                255
         ROCE_CC_ALGORITHM_P2                ECN(0)
         CLAMP_TGT_RATE_AFTER_TIME_INC_P1    True(1)
         CLAMP_TGT_RATE_P1                   False(0)
         RPG_TIME_RESET_P1                   300
         RPG_BYTE_RESET_P1                   32767
         RPG_THRESHOLD_P1                    1
         RPG_MAX_RATE_P1                     0
         RPG_AI_RATE_P1                      5
         RPG_HAI_RATE_P1                     50
         RPG_GD_P1                           11
         RPG_MIN_DEC_FAC_P1                  50
         RPG_MIN_RATE_P1                     1
         RATE_TO_SET_ON_FIRST_CNP_P1         0
         DCE_TCP_G_P1                        1019
         DCE_TCP_RTT_P1                      1
         RATE_REDUCE_MONITOR_PERIOD_P1       4
         INITIAL_ALPHA_VALUE_P1              1023
         MIN_TIME_BETWEEN_CNPS_P1            0
         CNP_802P_PRIO_P1                    6
         CNP_DSCP_P1                         48
         CLAMP_TGT_RATE_AFTER_TIME_INC_P2    True(1)
         CLAMP_TGT_RATE_P2                   False(0)
         RPG_TIME_RESET_P2                   300
         RPG_BYTE_RESET_P2                   32767
         RPG_THRESHOLD_P2                    1
         RPG_MAX_RATE_P2                     0
         RPG_AI_RATE_P2                      5
         RPG_HAI_RATE_P2                     50
         RPG_GD_P2                           11
         RPG_MIN_DEC_FAC_P2                  50
         RPG_MIN_RATE_P2                     1
         RATE_TO_SET_ON_FIRST_CNP_P2         0
         DCE_TCP_G_P2                        1019
         DCE_TCP_RTT_P2                      1
         RATE_REDUCE_MONITOR_PERIOD_P2       4
         INITIAL_ALPHA_VALUE_P2              1023
         MIN_TIME_BETWEEN_CNPS_P2            0
         CNP_802P_PRIO_P2                    6
         CNP_DSCP_P2                         48
         LLDP_NB_DCBX_P1                     False(0)
         LLDP_NB_RX_MODE_P1                  OFF(0)
         LLDP_NB_TX_MODE_P1                  OFF(0)
         LLDP_NB_DCBX_P2                     False(0)
         LLDP_NB_RX_MODE_P2                  OFF(0)
         LLDP_NB_TX_MODE_P2                  OFF(0)
         DCBX_IEEE_P1                        True(1)
         DCBX_CEE_P1                         True(1)
         DCBX_WILLING_P1                     True(1)
         DCBX_IEEE_P2                        True(1)
         DCBX_CEE_P2                         True(1)
         DCBX_WILLING_P2                     True(1)
         KEEP_ETH_LINK_UP_P1                 True(1)
         KEEP_IB_LINK_UP_P1                  False(0)
         KEEP_LINK_UP_ON_BOOT_P1             False(0)
         KEEP_LINK_UP_ON_STANDBY_P1          False(0)
         KEEP_ETH_LINK_UP_P2                 True(1)
         KEEP_IB_LINK_UP_P2                  False(0)
         KEEP_LINK_UP_ON_BOOT_P2             False(0)
         KEEP_LINK_UP_ON_STANDBY_P2          False(0)
         NUM_OF_VL_P1                        _4_VLs(3)
         NUM_OF_TC_P1                        _8_TCs(0)
         NUM_OF_PFC_P1                       8
         NUM_OF_VL_P2                        _4_VLs(3)
         NUM_OF_TC_P2                        _8_TCs(0)
         NUM_OF_PFC_P2                       8
         DUP_MAC_ACTION_P1                   LAST_CFG(0)
         SRIOV_IB_ROUTING_MODE_P1            LID(1)
         IB_ROUTING_MODE_P1                  LID(1)
         DUP_MAC_ACTION_P2                   LAST_CFG(0)
         SRIOV_IB_ROUTING_MODE_P2            LID(1)
         IB_ROUTING_MODE_P2                  LID(1)
         PCI_WR_ORDERING                     per_mkey(0)
         MULTI_PORT_VHCA_EN                  False(0)
         PORT_OWNER                          True(1)
         ALLOW_RD_COUNTERS                   True(1)
         RENEG_ON_CHANGE                     True(1)
         TRACER_ENABLE                       True(1)
         BOOT_UNDI_NETWORK_WAIT              0
         UEFI_HII_EN                         True(1)
         BOOT_DBG_LOG                        False(0)
         UEFI_LOGS                           DISABLED(0)
         BOOT_VLAN                           1
         LEGACY_BOOT_PROTOCOL                NONE(0)
         BOOT_RETRY_CNT                      NONE(0)
         BOOT_LACP_DIS                       True(1)
         BOOT_VLAN_EN                        False(0)
         BOOT_PKEY                           0
         ADVANCED_PCI_SETTINGS               False(0)
         SAFE_MODE_THRESHOLD                 10
         SAFE_MODE_ENABLE                    True(1)

You will notice that there is an entry in there for LINK_TYPE_P1 which, in my case, says “IB(1)”.

I forget if your card has Mellanox’s virtual port interface (VPI) technology where you can change the port type from Infiniband to Ethernet or Ethernet back to Infiniband.

One of the ways that you could test it would be to issue the command:

mstconfig -d 04:00.0 set LINK_TYPE_P1=2

and then query your card again to make sure that the changes were successful.

If it worked, then you are now running your Infiniband card with the ports in Ethernet mode.

If not, then this might be one of the reasons why your attempts at trying to run RDMA over Converged Ethernet (RoCE) isn’t working for you.

(For me, I am running straight RDMA (over Infiniband), and so, I don’t have this problem with the whole Ethernet layer/side of things.)

Your cards should be ETH only then, which in that case, you are absolutely correct in that you don’t need to have a subnet manager running at all then.

Yeah…so it really depends.

Like I am using four HGST 6 TB SATA 6 Gbps 7200 rpm mechanically rotating hard disk drives in RAID0, so it’s not terribly fast either. (The best that I’ve been able to get is about 800 MB/s combined write speed or about 200 MB/s contribution per drive, which works out to be around 6.4 Gbps.)

So by that metric, even regular 10 GbE would work and it would be a LOT easier to set up than RoCE.

In my use case which is understandably, a lot different than what most other people use it for is I actually will use the 100 Gbps Infiniband upto around 80 Gbps or so for high performance computing/computer aided engineering applications.

That’s where it really comes into play.

All of this NFS over RDMA stuff is just an added extra bonus of “things that I can do with 100 Gbps Infiniband” that I couldn’t really do with normal gigabit ethernet.

I think that when I first got my hardware, I did run some tests with straight NFS (no RDMA) and it was maybe doing about 150 MB/s writes.

So, needless to say, the RDMA DOES makes things faster, but with mechanically rotating hard disk drives, it has an upper limit based on the physics of the medium itself. (Not going to be breaking any records anytime soon.)

Conversely, because I went straight to 100 Gbps Infiniband, I’ve kind of skipped over the 10 Gigabit layer/step entirely (along with also skipping 25G, 40G, 50G, and even 56G).

So I use it as much as possible, in as many ways as possible, in as many places/systems as possible.

gordonthree · January 26, 2022, 3:44am

I installed mstconfig using the mstflint package, here’s the output

# mstconfig -d 85:00.0 q

Device #1:
----------

Device type:    ConnectX3Pro
Device:         85:00.0

Configurations:                              Next Boot
         SRIOV_EN                            False(0)
         NUM_OF_VFS                          0
         PHY_TYPE_P1                         0
         XFI_MODE_P1                         _10G(0)
         FORCE_MODE_P1                       False(0)
         PHY_TYPE_P2                         0
         XFI_MODE_P2                         _10G(0)
         FORCE_MODE_P2                       False(0)
         LOG_BAR_SIZE                        0
         BOOT_OPTION_ROM_EN_P1               False(0)
         BOOT_VLAN_EN_P1                     False(0)
         BOOT_RETRY_CNT_P1                   0
         LEGACY_BOOT_PROTOCOL_P1             None(0)
         BOOT_VLAN_P1                        0
         BOOT_OPTION_ROM_EN_P2               False(0)
         BOOT_VLAN_EN_P2                     False(0)
         BOOT_RETRY_CNT_P2                   0
         LEGACY_BOOT_PROTOCOL_P2             None(0)
         BOOT_VLAN_P2                        0
         IP_VER_P1                           IPv4(0)
         IP_VER_P2                           IPv4(0)
         CQ_TIMESTAMP                        False(0)
         STEER_FORCE_VLAN                    False(0)

I don’t see any mention of RoCE in there … I’m wondering if that feature is disabled? These are unfortunately HP OEM cards, rather than pure Mellanox.

alpha754293 · January 26, 2022, 3:56am

Let me go look at and/or download the user’s guide/manual for your ConnectX-3 EN.

Hang on.

Would you happen to know what the HP part numbers are for the card?

Or how the HP part numbers map onto the Mellanox part numbers?

I downloaded the product brief for the ConnectX-3 EN cards and apparently, it comes in four flavours:

MCX311A-XCAT Single 10 GbE SFP+
MCX312A-XCBT Dual 10 GbE SFP+
MCX313A-BCBT Single 40/56 GbE QSFP
MCX314A-BCBT Dual 40/56 GbE QSFP

So it would depend on what the HP part numbers map onto the Mellanox part numbers.

wendell · January 26, 2022, 4:03am

I somehow missed this thread. @alpha754293 you are awesome. That is all. Thank you.

alpha754293 · January 26, 2022, 4:04am

@wendell
Thank you.

You’re awesome yourself as well.

gordonthree · January 26, 2022, 4:05am

This is the card I’m using, the 546SFP+. Looks like that would be the MCX312A-XCBT?

https://support.hpe.com/hpesc/public/docDisplay?docId=c04636262&docLocale=en_US

Pulled this from lspci

                Product Name: HP Ethernet 10G 2-port 546SFP+ Adapter
                Read-only fields:
                        [PN] Part number: 779793-B21
                        [EC] Engineering changes: C-5733
                        [SN] Serial number: IL273302BL
                        [V0] Vendor specific: PCIe 10GbE x8 6W
                        [V2] Vendor specific: 5733
                        [V4] Vendor specific: 98F2B3CE9B50
                        [V5] Vendor specific: 0C
                        [VA] Vendor specific: HP:V2=MFG:V3=FW_VER:V4=MAC:V5=PCAR
                        [VB] Vendor specific: HP ConnectX-3Pro SFP+

alpha754293 · January 26, 2022, 4:26am

So with it being a ConnectX-3 Pro EN card, the Mellanox part number that the dual 10 GbE SFP+ variant maps onto is the MCX312B-XCCT.

“Since the SM is not present, querying a path is impossible. Therefore, the path record structure must be filled with relevant values before establishing a connection. Hence, it is recommended working with RDMA-CM to establish a connection as it takes care of filling the path record structure”

(Source: RDMA over Converged Ethernet (RoCE) - MLNX_EN v4.9-4.1.7.0 LTS - NVIDIA Networking Docs)

You can read this section (RDMA over Converged Ethernet (RoCE) - MLNX_OFED v4.9-4.1.7.0 LTS - NVIDIA Networking Docs) on how to set up and enable RoCE.

Again, with your card being a ConnectX-3 Pro EN, you can use RoCE v2 if you want to.

I’ve never tried setting up RoCE on my Infiniband cards, but from reading the MLNX OFED driver documentation about (the documentation for the EN driver points to the “full” MLNX OFED driver documentation for instructions on how to enable RoCE and it seems like it would be quite the pain (in the absence of a subnet manager) because you have to maybe perform extra steps according to the documentation/instructions, so I have no idea how to do that.

Sorry.

gordonthree · January 26, 2022, 4:24pm

The info in that KB article seems to be one of the steps I was missing. Enabling RoCE v2 via modprobe options for mlx4_core has gotten me a few steps farther! I can now do rping and ucmatose between the servers, in both directions.

I was able to get nfs to mount using version 3, which I think is OK. It says it’s using RDMA, but when I copy large files, I’m able to see the traffic on the interface counters, which means the kernel / os can see the traffic as well, so RDMA isn’t actually doing it’s magic? At least traffic is showing up on the correct interface!

Command: mount -o proto=rdma 10.19.0.14:/mnt/vmstore /mnt/nfs-ssd
Result:

10.19.0.14:/mnt/vmstore on /mnt/nfs-ssd type nfs (rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=10.19.0.14,mountvers=3,mountproto=tcp,local_lock=none,addr=10.19.0.14)

and on the NFS server:

Jan 26 11:15:18 server kernel: mlx4_core 0000:0b:00.0: Have more references for index 1, no need to modify mac table
Jan 26 11:15:18 server rpc.mountd[4062]: authenticated mount request from 10.19.0.16:1006 for /mnt/vmstore (/mnt/vmstore)
Jan 26 11:15:18 server kernel: mlx4_core 0000:0b:00.0: Registering MAC: 0x98f2b3ce9b51 for port 2 without duplicate

alpha754293 · January 26, 2022, 5:21pm

Yeah…I’m not really sure how RDMA over Converged Ethernet works because in theory, RDMA shouldn’t affect the network counters, but the Ethernet should/does.

re: " mount -o proto=rdma 10.19.0.14:/mnt/vmstore /mnt/nfs-ssd"
I was reading the Linux man pages for nfs(5) and I don’t have a better understand about how proto interacts with mountproto when proto=rdma.

(Source: nfs(5) - Linux man page)

So…I have no idea.

(edit#2 On the IB side, things work a little bit differently:

aes0:/home/cluster on /home/cluster type nfs4 (rw,noatime,nodiratime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=10.0.1.117,local_lock=none,addr=10.0.1.100)

I don’t have mountproto for my NFSoRDMA mounts.)

I guess that one of things that you might be able to check would be to see if you have any real appreciable speed differences (if any) and/or any real or noticable/appreciable differences in reducing any CPU load/utilisation (again, if any, at all).

I know that you said that even without RoCE, you were saying that the CPU utlisation was very low, so I’m not sure if you might really see nor notice a difference.

But if rping and ucmatose is working, so that suggests that it is working for you, and I assume that this also means that prior to that, it meant that said rping and ucmatose WASN’T working for you previously.

On my Infiniband side of things, it RDMA works differently because with IB, I am LITERALLY and COMPLETELY skipping over the Ethernet layer and I assign a IPv4 address (which works with IPoIB) so none of my “normal” system monitoring tools for the network will read or be able to pick up on the IB/RDMA traffic.

As such, I have no idea how RoCE works because I don’t have it deployed on my cluster.

(If 100 GbE switches weren’t still so damn expensive, I might look into potentially deploying 100 GbE as well (because my ConnectX-4 cards are dual port, VPI which means that on the same card, I can have one port running in IB mode and the other port running in ETH mode. But at that point, the PCIe 3.0 x16 interface will become my limiting factor because that can only support upto 128 Gbps and a single 100 Gbps IB port can already take up almost all of that bandwidth, which means for me to get more bandwidth, I would have to replace my cards and systems with something that supports at least PCIe 4.0 x16, which I’m not looking to do). But I digress.)

Sounds like it’s up and running for you.

Just make sure that your /etc/rdma/rdma.conf parameters:

on the host and clients:
XPRTRDMA_LOAD=yes

And additionally, on the host:
SVCRDMA_LOAD=yes

are set.

(If you want to test it, and if your systems have enough RAM available, you can try and create a RAM drive (tmpfs), write a file to it, and then send it over and see what kind of speeds you’re getting. That might be one way for you to test to see if RDMA is working for you or not.)

I have also found that in “normal” data transfers (i.e. for storage management), RDMA is really limited to the speeds of the storage devices. So for me, because I am using spinning rust hard drives, 800 MB/s write (~6.4 Gbps) is about the best that I can do.

Conversely, when I use an application that uses the message passing interface (MPI), the application can use upto around 80 Gbps during a solve process.

So, true to its name, remote direct memory access - the speeds are usually realised when it’s RAM-to-RAM transfers. I think that even with four Samsung 860 EVO 1 TB SATA 6 Gbps SSDs in RAID0, the best that I’ve been able to see is maybe around 4.69 GB/s (~37.52 Gbps), but that’s the exception rather than the norm. (And sometimes, this is possible due to RDMA to RAM to SSD cache so it’s measuring the interface-to-cache speed.)

So…that might be a way for you to test to check and verify that RoCE is working properly for you.

edit

I’m not sure if it matters or not, but I am using NFS version 4.1 on my CentOS cluster.

gordonthree · January 26, 2022, 7:43pm

What a great thread, I’m learning a lot and appreciate your detailed responses.

Good point about ethernet frames incrementing the counters, that could be what’s going on, I was using glances to monitor CPU and network while copying a 100G file from one SSD to another over the network, so it could have been showing me ethernet utilization, rather than tcp/ip throughput. I will have to use other tools to see exactly what counters are incrementing.

As I understand it, after reading a kernel mailing list thread from 2012 (here), mountproto is the method used to connect and disconnect the mount, for nfs v2 and v3 connections, where as proto is the transport method used for carrying the data. So tcp was used to authenticate or whatever with the nfs server, and rdma is used to transport data, I hope?

I’m not certain when, but it looks like the xprtrdma and svcrdma modules are obsolete, and replaced by rpcrdma now… maybe something new in Linux 5.x?

modinfo xprtrdma
filename:       /lib/modules/5.15.14-200.fc35.x86_64/kernel/net/sunrpc/xprtrdma/rpcrdma.ko.xz
alias:          rpcrdma6
alias:          xprtrdma
alias:          svcrdma
license:        Dual BSD/GPL
description:    RPC/RDMA Transport
author:         Open Grid Computing and Network Appliance, Inc.
depends:        ib_core,sunrpc,rdma_cm

An oddity I noticed, one of my virtual machines seems to have established an NFSv4 RDMA connection to the host server. The VM is using an SR_IOV virtual function passed through from the Mellanox card. Not sure why it works, but I won’t complain.

the-ripper:/mnt/storage/movies on /mnt/storage/movies type nfs4 (rw,relatime,vers=4.2,rsize=8192,wsize=8192,namlen=255,soft,proto=rdma,port=20049,timeo=14,retrans=2,sec=sys,clientaddr=192.168.2.12,local_lock=none,addr=192.168.2.14)

However, if I force my test environment to use nfs vers=4, it complains with a protocol error.

Hey at least this is enough progress to keep me interested.

gordonthree · January 26, 2022, 9:40pm

Testing with cached reads on the server, and a ramdisk on the client, I was seeing about 8 out of 10gbit at times, just showing up as packets on the interface. the tcp and udp transport numbers were barely moving, probably from other things the server is doing.

The cp command was showing between 70 and 99% single thread CPU usage. But maybe that’s just how cp works, rather than cpu load from transport overhead. I didn’t see nfsd spike on cpu usage for example.

So I guess that means RDMA is actually skipping some of the middle-men along the way!

Now I just need to figure out why my ubuntu virtual machine will talk NFS v4.2 over RDMA to my server, but my Fedora physical machine will only talk NFS v3 over RDMA

EDIT:
A bit more, not very scientific testing., Copying from a ramdisk, to a ramdisk,

RDMA direct cable connection: Peak of 8gbit sustained, nfsd cpu usage 6 to 9%.
RDMA through basic L3 switch (no DCB/QoS support): Peak of 8gbit sustained, nfsd cpu usage 9-11% (more retries maybe?)
TCP direct connection: Peak of 6gbit sustained, nfsd cpu usage 16-20%
TCP through the switch: same as above

alpha754293 · January 27, 2022, 2:08am

Thank you and you’re welcome.

Yeah, so for IB, because RDMA completely bypasses Ethernet and the network stack, I HAVE to use IB specific tools to be able to read the NIC’s packet counters, etc.

(I haven’t ran my IB card in ETH mode in probably over 3-5 years now, so my memory on it is a little bit fuzzy. I think that I VAGUELY remember that when I do run it in ETH mode though, that the GNOME System Monitor will pick up on the ethernet traffic (and show/plot it). But I don’t remember if that was when I JUST got my cards so I was just using a DAC cable between two systems to test it out and to make sure that the cards were working, and LONG before I learned about everything else that I have learned since then.)

Yeah…not sure.

My CentOS cluster headnode is still running the 3.10.0-1062 kernel that ships with CentOS 7.7.1908. My principle thought process with running such an “old” kernel is that for my 4930K that’s in my cluster headnode, it doesn’t have NVMe slots, it doesn’t have a lot of things. So, if it isn’t broken, there is no real reason for me to update it. (If it works, don’t break it.)

I also, further, don’t know if those modules are a RedHat/CentOS thing or if it is something else. shrug

For the record though, svcrdma doesn’t appear to be a module because when I type in lsmod | grep rdma, it doesn’t show up in there.

As I mentioned, it’s in /etc/rdma/rdma.conf. Again, shrug. YMMV.

Like I said earlier, “if it works, don’t break it!”

lol…

(Sidenote: I would recommend you keeping a OneNote or some other note taking tool/app/whatever so that you can document your deployment notes from the lessons learned here. This is how I was able to provide the instructions for how I deploy my NFSoRDMA setup so if you have stumbled upon what is working for you, now would be going back through the command history and jotting that down so that in the event that you have to redeploy your server/clients/both – you will have all of the commands that you need for said deployment.)

I’m glad that this appears to be working for you now.

Yay!

That’s pretty good. 80% of the theorectical/rated capacity is usually what I would observe, max, in service/in practice.

Yeah…so, that depends.

Like I know that if I use rsync, rsync has/does its own thing to make sure the sources and targets are the same, so it’s doing some kind of checking with that, so I know that takes CPU load to do that kind of processing.

For nfsd, I would see it show up in the load averages in top, but in the GNOME System Monitor, it doesn’t show up as CPU load. shrug

Yeah…that’s weird.

And what makes that even weirder is that my older CentOS 7.7.1908 is able to negotiate the NFS connection using nfs4 version 4.1 automatically whilst the newer Fedora didn’t/couldn’t.

So, there might be a bit more research needed into that, if you are really that interested in it. (I don’t have Fedora set up/deployed, so unfortuantely, I don’t have a way to test that, plus again, the fact that I am using IB instead of ETH also makes a difference.)

re: your results
Looks like the NFSoRDMA between your server and Fedora and Ubuntu clients are working (although, like you said though, it might appear that the Ubuntu, with SR-IOV, might be working a LITTLE bit better than your physical Fedora client). But it still appears to be working, which is the important piece.

(Sidenote: I have struggled to get Ubuntu to be the NFSoRDMA SERVER. I don’t have it in my OneNote notes that shows that I ever got that working properly because it would complain about proto=rdma nor would it take port=20049. So…it seems that Linux distros that are derived from RedHat works better for NFSoRDMA servers than Ubuntu servers/systems. I might revisit that in the future, but again, right now, CentOS is working for me as my NFSoRDMA server, so I’m not going to spend much time messing with it when it (already) works. Ubuntu clients seems to work better than Ubuntu servers for this. shrug Go figures.)

You get higher transfer rates and a lower CPU usage with NFSoRDMA vs. without, so that gives some data/confidence that it appears to be doing what it is supposed to.

I just googled “RDMA copy” and this is a project that came up:

So…might be worthwhile for you to look into to see if that might help reduce your CPU load even further, if you really want/need/or otherwise interested in trying to do so.

But 8 Gbps on a 10 GbE link is about where I would expect it to max out unless you start spending quite a bit of time, tuning the network performance parameters.

For an untuned network, it’s still pretty good, I think.

For my 100 Gbps IB network, given that in practice, my slowest storage devices are mechanically rotating hard drives, so I don’t even bother trying to tune my network because the best write speeds that I can get is about 800 MB/s (or ~6.4 Gbps out of 100 Gbps that’s theorectically possible).

It’s good enough. Faster than my GbE network, and I try not to use SSDs of any kind (if I can avoid it, as much as possible) because if I managed to “unlock” writing to say NVMe SSDs at the full ~12.5 GB/s (100 Gbps), I would burn through the write endurance limit on said SSDs, even enterprise grade ones, through repeated usage of the blazing fast speed (which would mean the SSD would die, and I would have to try and RMA it for a new one).

So I don’t bother with it anymore.

Yes, it’s nice to see those speeds, but if/when you handle or process enough data where you can kill a SSD in a little over a year-and-a-half, it just isn’t worth it to me anymore.

gordonthree · January 27, 2022, 2:08pm

I’ll check that out. I have a few bookmarks saved for doing zfs snapshots send/recv over RDMA, that’s probably the bulk of what I’d be sending over the network between the two machines.

I wonder if this has something to do with the bottleneck? I’m not sure if the 16.000Gb/s the kernel is talking about is gigabytes, or gigabits, but it seems my network card is only linking at PCIe 1.0 speed 2.5 GT/s, vs 5.0 GT/s for PCIe2 and 8.0 for PCIe3, IIRC

threadripper machine:

mlx4_core 0000:0b:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x8 link at 0000:00:03.2 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link)

dual xeon machine:

mlx4_core 0000:85:00.0: 63.008 Gb/s available PCIe bandwidth (8.0 GT/s PCIe x8 link)

alpha754293 · January 27, 2022, 3:41pm

Unfortunately, I don’t have any direct experience related to this, but my thoughts is that depending on the mechanism that it is using for the data transfer (whehter it’s “straight” cp or if it is rsync or if it is RDMA copy, there might be ways to make the transfers go faster).

As a part of my google search for “RDMA copy” yesterday, I also found this (that might be of interest to you) if your transfers are primarily unidirectional:

Gigabits.

A PCIe 1.0 2.5 GT/s x8 link is capable of 2.000 gigabytes per second transfers (16 gigabits/s).

(Source: PCI Express - Wikipedia)

Whereas a PCIe 3.0 8 GT/s x8 link is capable of 7.877 GB/s (64 Gb/s) transfers.

But it is interesting and strange that your card is only connecting at PCIe 1.0 x8, so I would check to make sure that the BIOS settings are correct and that you might need to power down the system so that you can unseat and reseat the card to make sure that it is making proper, electrical contact.

That seems strange that your Threadripper machine is doing that.

But to your question if it might have something to do with it - it is entirely possible.

I don’t remember if the PCI Express specification denotes the bit rates or the transfer rates (in transfer per second) as being unidirectional or bidirectional.

If it is bidirectional, then it would make sense for the card to be limited by 8 Gbps transfer rate (half of 16 Gbps). But if it is unidirectional, then even with a 8 Gbps transfer rate, there should be still another 8 Gbps of headroom available, so not really sure.

But I would definitely look into that and try and figure out why it is doing that because your Threadripper system should have no shortage of PCIe 4.0 lanes to support this even if the card itself is only a PCIe 3.0 card.