How to set up RoCE?

Hello, I recently updated my networking to 40GbE.

I am using a Brocade ICX6610 switch and two Mellanox ConnectX-3 VPI cards between my PC and NAS. Both systems run Arch Linux.

My link is at 40Gbps but my transfer speeds are ~22-25Gbps.

I am looking to implement RDMA over ethernet (RoCE) but I don’t know where to start. I don’t understand the Infiniband stuff at all.

Any pointers?

Edit: The Red Hat Documentation suggests that RDMA works automatically: Chapter 13. Configure InfiniBand and RDMA Networks Red Hat Enterprise Linux 7 | Red Hat Customer Portal

I will check to see if I am missing any modules/packages and report back.

What did you use to test your throughput? Was it a single stream test or parallel?

Getting line rate on a NIC can be difficult, especially when you are trying to do it with a single stream.

Are you using jumbo frames?
Did your benchmark tool support zerocopy?
Does your benchmark tool support setting the receive/send socket buffers/windows?
How busy were the cpus on either endpoint? is your switch busy?
What kind of tuning on the switch? any pfc, cos shaping / traffic diffrentiation, link aggregation?
Are you traversing a firewall or do you have firewalling enabled on either endpoint?
Did you pin your benchmark tools to the numa domain that has direct access to your NIC ?
What performance profile are you running on the client and the server?
Performance governors? cpu states
What msi mode are you using?
Have you applied network stack tuning for 40gbit like the suggestions on fasterdata.es.net?

Many questions with many knobs and levers to turn and pull.

Good luck and have fun :slight_smile:

3 Likes

… this and other things @greg_at_redhat mentions.

You don’t necessarily need all of them for mere 40Gbps over TCP, but all of them help a little.

Also, you can try perf top and perf stat … you might get lucky and something recognizable might jump out at you as an outlier.

Also, are these by any chance 10+ year old pass-me-down xeons or are they more modern machines?

2 Likes

These are some great pointers, I will be very happy to look into all of these. Thank you for your reply!

Just some general info for anyone interested:

The PC is running a Threadripper 2920x, the NAS is a Dell server R720xd running two E5-2650 v2s.

No jumbo frames at the moment.
Testing was done using iperf. Parallel streams did not make much of a difference. I did not try with multiple processes.
CPUs where basically idle.
No tuning on the switch, unfortunately it does not support PFC. The switch has only the two 40Gb hosts on it ATM.
No firewalls on or between the hosts.
I did not look into NUMA at all, this might be critical for my problem as the NAS is running two CPUs (and possibly and issue with the Threadripper that is running in NUMA mode).
Both systems running schedutil CPU governors.
No network tuning as of yet on the hosts.

I unexpectedly had to take the NAS down today in order to move it because we have to do some repairs at the house, so I won’t be able to run any tests for a brief period of time. I will have a look at everything mentioned and report back once I have everything running again.

Some other considerations:

  • Are these (client/server) virtual machines?

    • if so are they running in KVM?
    • If they are in KVM are they using the Virtio network driver/interface
    • if it is the virito network driver, have you enabled multi-queue support?
  • Are they Metal (client/server?) and Linux?

    • Have you looked at the rx/tx ring configuration on the NIC?
    • Have you considered enabling RX and TX packet steering in the host to leverage SMP?

We have some KB’s on performance tuning network on our site you might want to look at:

Before you begin
You are running RHEL 7 and the latest compatible SUSE Linux Enterprise Server 12 and 15 service pack operating system. See the NetApp Interoperability Matrix Tool for a complete list of the latest requirements.
Procedure
Install the rdma and nvme-cli packages:

zypper install rdma-core

zypper install nvme-cli

RHEL 7

yum install rdma-core

yum install nvme-cli

Setup IPv4 IP addresses on the ethernet ports used to connect NVMe over RoCE. For each network interface, create a configuration script that contains the different variables for that interface.
The variables used in this step are based on server hardware and the network environment. The variables include the IPADDR and GATEWAY. These are example instructions for the latest SUSE Linux Enterprise Server 12 service pack:

Create the example file/etc/sysconfig/network/ifcfg-eth4 as follows:

BOOTPROTO=‘static’
BROADCAST=
ETHTOOL_OPTIONS=
IPADDR=‘192.168.1.87/24’
GATEWAY=‘192.168.1.1’
MTU=
NAME=‘MT27800 Family [ConnectX-5]’
NETWORK=
REMOTE_IPADDR=
STARTMODE=‘auto’
Create the second example file/etc/sysconfig/network/ifcfg-eth5 as follows:

BOOTPROTO=‘static’
BROADCAST=
ETHTOOL_OPTIONS=
IPADDR=‘192.168.2.87/24’
GATEWAY=‘192.168.2.1’
MTU=
NAME=‘MT27800 Family [ConnectX-5]’
NETWORK=
REMOTE_IPADDR=
STARTMODE=‘auto’
Enable the network interfaces:

ifup eth4

ifup eth5

Set up the NVMe-oF layer on the host.
Create the following file under /etc/modules-load.d/ to load the nvme-rdma kernel module and make sure the kernel module will always be on, even after a reboot:

cat /etc/modules-load.d/nvme-rdma.conf

nvme-rdma

1 Like