TrueNAS ZFS 2x10G MPIO vs 1x40G - lower latency?

dxun · May 31, 2024, 4:06am

Hello, I am planning/testing a homelab storage ZFS backend for self-hosted Kubernetes cluster experiments. It’s an old Supermicro SuperStorage 6028R-E1CR16T with 2x E5-2650 v4, 128 GB RAM and a bunch of deduped zvols for VMs sitting on a pair of Optane P4800x SSDs. It’s running the latest TrueNAS Core (13-U6.1).

The zvols are exposed via iSCSI setup in MPIO config over a pair of 10G links. This is working well, but I’ve recently gotten hold a few Mellanox MCX354A-FCBT 40G NICs that I’d put in as a replacement. Due to hardware limitations, at the moment I can only run a single 40 G fibre to the switch.

Since the VMs are served by the mighty Optane, my idea is to try and lower network latencies in order to better exploit the Optane-backed VM pool. There are several other pools on this server but they are either HDD or NAND SSD pools, thus I don’t regard them as relevant right now.

My question is - ceteribus paribus, what would yield better total storage latencies (on average): a 2x10G MPIO or a 1x40G?

When I say “storage latencies”, I really mean “lowest average round-trip latency from Proxmox nodes hosting the VMs to the storage backend”. The primary Proxmox node would be connected via 40G link to the swtich, the others will be connected via 10G fibre links.

Network wise, looking at the raw serialisation delay on the network phy layer, the 1x40G link should smoke the 2x10G links (~ 0.3μs vs ~1.2μs, respectively) yielding lower queue depths…but then again, since there are 2 10G links available in MPIO, perhaps the queues would be lower and thus this would better (on average)? Or am I completely thinking about this the wrong way?

MadMatt · May 31, 2024, 6:39am

If latency is what you are focused on, that would be 40GB over infiniband …

dxun · May 31, 2024, 6:47pm

I agree but that is not an option I have at the moment. The original question still stands, however.

Ryno · June 4, 2024, 4:19am

The 40gb link should have less latency, but that being said if you are looking to link aggregation for two 40gb’s just remember that is a Gen3 x8 card and the most you can get out of it if everything is prefect is 64gb.

I don’t know about Proxmox but I have Issues getting ESXI hypervisor seeing 40gb cards.

Gen 3 is about 1 GB sec per lane, that is 8gbs per second and times 8 lanes is 64gbs.

It would also be beneficial to look into Jumbos and packet shaping when going over 10gb

Lastly, Just remember two 10gb linked aggregated does not give you 20gbs it allows allows load balancing and lets two host talk to the server at 10gb each.

Also not trying to be that guy, but just remember if you are talking data bandwidth you are normally talking about bites and bytes so it is lower case not upper.

dxun · June 9, 2024, 2:00pm

Thank you - everything you said is very close to my thinking as well.

I won’t be trying or expecting to fully saturate a 40 G link, probably ever. My thought experiment is trying to think through whether it’s realistic to expect that a 2x10 G MPIO will (on-average) experience lower queue depths than a 1x40 G link. I cannot see that happening but I also understand that the only definitive way of establishing this as a fact is benchmarking under real load. Was hoping that someone would poke a hole in my thinking and disprove my expectations.