[Solved] Dell S6000 40GbE basic setup

TL;DR - I’m clearly missing something in my switch config, I can’t get basic L2 routing working correctly. I have ping and ssh working, but any attempt to move traffic beyond a few KB/s kills the socket.

[solution]

I set MTU to 9000 everywhere and it still failed. I then set the switch to 12000 (its internally stated max) and the clients to 9000 and now it works)

Dell(conf)interface fortyGigE 0/0
Dell(conf-if-fo-0/0)#mtu 12000
Dell(conf-if-fo-0/0)#interface fortyGigE 0/8
Dell(conf-if-fo-0/8)#mtu 12000
Dell(conf-if-fo-0/8)#interface fortyGigE 0/4
Dell(conf-if-fo-0/4)#mtu 12000
Dell(conf-if-fo-0/4)#interface fortyGigE 0/12
Dell(conf-if-fo-0/12)#mtu 12000

I’m guessing that 9000 + ~100-300b would have worked as well since that’s what I recall the linux kernel/driver bug missing in its header computation…

I’ve managed to factory reset my ebay’d S6000 switch by interrupting the boot:

(esc)
BOOT_USER# ignore enable-password
BOOT_USER# reload
(normal boot)
Dell>enable
Dell#restore factory-defaults stack_unit all clear_all

I then connected exactly 2 linux machines to ports 0 and 8. These two machines are running Mellanox VPI adaptors set to “eth” mode. For reference, here is a single threaded iperf with those two machines connected point-to-point with a copper QSFP cable:

# iperf3 -c 10.6.9.20
Connecting to host 10.6.9.20, port 5201
[  4] local 10.6.9.22 port 39030 connected to 10.6.9.20 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  2.47 GBytes  21.2 Gbits/sec    0    866 KBytes       
[  4]   1.00-2.00   sec  2.75 GBytes  23.6 Gbits/sec    0    866 KBytes       
[  4]   2.00-3.00   sec  2.75 GBytes  23.6 Gbits/sec    0    866 KBytes       
[  4]   3.00-4.00   sec  2.75 GBytes  23.6 Gbits/sec    0    866 KBytes       

Here are the setup steps I’ve taken with the switch

Dell>enable
Dell># conf
(conf)stack-unit 0 provision S6000
(config)interface fortyGigE 0/0
(conf-if-fo-0/0)#no shutdown
(conf-if-fo-0/0)#switchport
(config)interface fortyGigE 0/8
(conf-if-fo-0/8)#no shutdown
(conf-if-fo-0/8)#switchport

I can ping:

PING 10.6.9.22 (10.6.9.22) 56(84) bytes of data.
64 bytes from 10.6.9.22: icmp_seq=1 ttl=64 time=0.117 ms
64 bytes from 10.6.9.22: icmp_seq=2 ttl=64 time=0.087 ms
64 bytes from 10.6.9.22: icmp_seq=3 ttl=64 time=0.099 ms
64 bytes from 10.6.9.22: icmp_seq=4 ttl=64 time=0.120 ms
64 bytes from 10.6.9.22: icmp_seq=5 ttl=64 time=0.114 ms

I can ssh and do simple commands without issue:

ssh [email protected]
# cd /bin
# ls
abrt-action-analyze-backtrace
abrt-action-analyze-c
abrt-action-analyze-ccpp-local
abrt-action-analyze-core
abrt-action-analyze-oops
abrt-action-analyze-python
abrt-action-analyze-vmcore
....

HOWEVER, any higher bandwith activity hangs the link:

Connecting to host 10.6.9.20, port 5201
[  4] local 10.6.9.22 port 39002 connected to 10.6.9.20 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  6.25 MBytes  52.4 Mbits/sec    2   8.75 KBytes       
[  4]   1.00-2.00   sec  0.00 Bytes  0.00 bits/sec    1   8.75 KBytes       
[  4]   2.00-3.00   sec  0.00 Bytes  0.00 bits/sec    0   8.75 KBytes       
[  4]   3.00-4.00   sec  0.00 Bytes  0.00 bits/sec    1   8.75 KBytes       
[  4]   4.00-5.00   sec  0.00 Bytes  0.00 bits/sec    0   8.75 KBytes     

The “hang” appears to function on a socket level… ^C on iperf then allows ping to work afterward.

....
stack-unit 0 quad-port-profile 0,8,16,24,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,100,108,116,124
!
stack-unit 0 provision S6000
!
interface fortyGigE 0/0
 no ip address
 switchport
 no shutdown
!       
....
interface fortyGigE 0/8
 no ip address
 switchport
 no shutdown
!

EDIT: I’ve tried ports 8 &12 as well - identical result
I also see no errors/drops reported on the client interfaces:

ens1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
        inet 10.6.9.22  netmask 255.255.255.0  broadcast 10.6.9.255
        RX packets 147973  bytes 8946946 (8.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 2312667  bytes 20721437985 (19.2 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
1 Like

Welll… MTU=9000 appears to be the problem…

but why??? Does this switch really not support jumbo packets over 40GbE?

I disabled that and now:

iperf3 -c 10.6.9.20 -P 8
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[  4]   0.00-10.00  sec  4.64 GBytes  3.98 Gbits/sec                  receiver
[  6]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[  6]   0.00-10.00  sec  4.64 GBytes  3.98 Gbits/sec                  receiver
[  8]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[  8]   0.00-10.00  sec  4.64 GBytes  3.98 Gbits/sec                  receiver
[ 10]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[ 10]   0.00-10.00  sec  4.63 GBytes  3.98 Gbits/sec                  receiver
[ 12]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[ 12]   0.00-10.00  sec  4.63 GBytes  3.98 Gbits/sec                  receiver
[ 14]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[ 14]   0.00-10.00  sec  4.63 GBytes  3.98 Gbits/sec                  receiver
[ 16]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[ 16]   0.00-10.00  sec  4.63 GBytes  3.98 Gbits/sec                  receiver
[ 18]   0.00-10.00  sec  4.65 GBytes  3.99 Gbits/sec    0             sender
[ 18]   0.00-10.00  sec  4.63 GBytes  3.98 Gbits/sec                  receiver
[SUM]   0.00-10.00  sec  37.2 GBytes  31.9 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec  37.1 GBytes  31.8 Gbits/sec                  receiver
2 Likes

Gotta be something else you need for jumbo packets, a 40GbE switch without jumbo packets is like a 150 Mbps connection with a 1TB data limit.

Oh…

2 Likes

Very much that… Given that high-speed NFS for large (many gigabyte) files for my personal compute cluster is my reason for this link…

1 Like

and to completely answer my own question - the issue is a two-parter:

  1. the default MTU is 1554
  2. Whatever you set, you may need to leave head-room for linux to “fudge” the MTU. I vaguely recall a bug where some header math was done incorrectly on MTU which caused 10GbE cards to malfunction with 9000 MTU

I set MTU to 9000 everywhere and it still failed. I then set the switch to 12000 (its internally stated max) and the clients to 9000 and now it works)

Dell(conf)interface fortyGigE 0/0
Dell(conf-if-fo-0/0)#mtu 12000
Dell(conf-if-fo-0/0)#interface fortyGigE 0/8
Dell(conf-if-fo-0/8)#mtu 12000
Dell(conf-if-fo-0/8)#interface fortyGigE 0/4
Dell(conf-if-fo-0/4)#mtu 12000
Dell(conf-if-fo-0/4)#interface fortyGigE 0/12
Dell(conf-if-fo-0/12)#mtu 12000

I’m guessing that 9000 + ~100-300b would have worked as well since that’s what I recall the linux kernel/driver bug missing in its header computation…

1 Like

You need to configure the MTU on the switch ports as well. We use 9126 on appliances which have a similar switch.

1 Like

40 Gbps / 1500 bytes is 3.33 million packets per second… 300ns is roughly 1000cycles per packet per core… that should be plenty, no?

1500 vs 9000 MTU took a big bite out of 10GbE NFS performance when “streaming” (aka: block-wise sequential access of) large files which is my primary use-case. Given the underlying cause of this (kernel churn on packet processing hitting limits (“packets per second” vs GHz to process them), I would expect nearly identical scaling of Jumbo vs non in 40G we see with 10G.

RDMA may render this moot - I haven’t set that up yet.

1 Like

Have you compared NFS to e.g. just catting the file with mbuffer or similar?

No, but I’m not sure what this would provide me? Sorry if I confused, but my reference to “sequential” access was to describe the flavor, but not the entire function of my desire to optimize NFS access.

My desired functionality isn’t just blind transmit/receive of files, but rather has embedded functionality/caching that minimizes random access and caters to the tendency of the application to desire/process large sequential blocks (MBs to GBs) at a time. That data is also manipulated once pulled from the disk, prior to being used, so there is a benefit of minimizing churn to-from disk of a given region.

Whether the actual application or dd, 10GbE showed significant (it’s been a while, but on the order of 20-30% IIRC) performance delta with jumbo vs non-jumbo frames. Not alone in that, hence Intel’s app-notes for optimizing 10GbE connections.

1 Like