Didn’t find the thread for the recently published video, so I am creating one.
(Mods: Please feel free to move this to wherever you see fit.)
@wendell
In regards to the questions that you asked in your video:
A few things:
-
If you’re using Ethernet on your Mellanox cards rather than Infiniband, then you’ll probably want to enable RoCE v2.
-
In my experience, for just about anything (most things) RDMA related, RHEL derivatives tend to perform better than Debian derivatives.
For example, if you want to enable SRIOV for Infiniband (Mellanox ConnectX-4 100 Gbps dual port IB NIC), the OpenSM subnet manager that runs on Debian doesn’t support virtual functions (if you’re using an externally managed 100 Gbps IB switch like I am (because externally managed switches are cheaper than Mellanox’s managed 100 Gbps IB switches). In talking with the dev team, they have no plans to port that over from RHEL (and its derivatives) to Debian.
So, if you want IB VFs/SRIOV, you’ll either need to run a managed IB switch or you’ll need to run RHEL (or its derivative) so that you can run the RHEL variant of opensm.
- I have found that IB tended to work better than ETH. When I set my VPI port to run as ETH rather than as IB and I had two 5950X nodes that were connected together with a point-to-point connection (via DAC), the Proxmox host itself can use 8 streams of
iperf
and hit I think it was close to 96.9 Gbps over IB, but it was only hitting around 23.4 Gbps max with the same 8iperf
streams when it was running with an Ethernet bridge.
So, the PHY protocol matters.
(The idea with this test was trying to get multiple VMs and/or LXCs sharing the same 100 Gbps connection as much as possible. It’s a pity that the VFs/SRIOV ended up being a bust on the Debian-based Proxmox system.)
It’s also a pity that xcp-ng, which is the type 2 hypervisor that’s based on a RHEL-derivative, doesn’t have as much going for it vs. Proxmox. (e.g. doesn’t support virtio-fs, which, ironically, was developed be a senior software engineer at RHEL).
- If you want higher levels of performance, CentOS/Rocky Linux allows very CLOSE performance levels to RHEL that isn’t RHEL.
(I used to run CentOS on my micro HPC cluster (including the headnode) where I had four Samsung 850 EVO 1 TB SATA 6 Gbps SSDs pegged (at least once) to 38 Gbps out of the possible 24 Gbps (combined) from the four drives. (It was a HW RAID0 array, managed by the LSI MegaRAID SAS 9341-8i, (I wanted the capacity and speed of the four drives, didn’t care about redundancy nor fault tolerance), formatted with XFS, and then exported to the network via NFSoRDMA.)
- I don’t know if the ETH Mellanox 100 GbE cards can do this, but you can also look into iSER (basically iSCSI over RDMA).
You might be able to get better performance with that.
But yeah, there’s a LOT that you can do.
- re: AI workloads – I’ve found that unless you’re doing the model training, LOADING a model doesn’t actually take that much. Even a PCIe 3.0 x4 NVMe SSD is PLENTY sufficient. (I think that I was loading the codestral:22b model last night at something like < 500 MB/s from the Intel 670p series 1 TB NVMe SSD.)
There are limits as to how far you can push data/applications to load.
-
100% agree with you that for media storage, the PM7s would be wasted.
-
Unless you are going to boot and reboot VMs often, you don’t necessarily need super fast storage for that neither.
(LXCs boot even faster.)
Windows, interestingly enough, when you analyze the data access pattern when it is booting, is a mix of load times, but then it’s also a mix of processing time as well, when it is booting the Windows kernel.
Again, you can only push both of those so much/so far.
(I’ve tried booting Windows off of a RAM drive. The RAM itself can be very fast, but Windows, didn’t really had much in the way of an appreciable reduction in boot time, despite giving it a MUCH faster storage medium/interface.)
- For Steam games, I’ve found that when you’re loading a game, it is, again, not as intensive as it is getting the game (and its updates) installed.
The installation realised more of a benefit from faster storage and network subsystem (i’m already running iSCSI for my Windows gaming clients), but the game itself (e.g. Halo Infinite) couldn’t load the game much faster than 500 MB/s, even if the drive’s STR is capable of 2-3.5 GB/s read. It doesn’t scale with the capability of the drive.
(As such, even a SATA 6 Gbps SSD is able to deliver the 500 MB/s that Halo Infinite was requesting. Giving it a faster drive (also tested it with Silion Power US70 PCIe 4.0 x4 NVMe SSD) (on a 7950X, with a Supermicro H13SAE-HF) again barely reduced the time it took to load Halo Infinite.)
You’ll probably realise more benefit with more concurrent users. But for single user (or mostly single user scenario), the benefits are mehhh…
edit
Apparently the codestral:22b model WILL ask for ~1.4 GB/s read speed from my Intel 670p NVMe 3.0 x4 SSD.
So I guess there is a use case for an all NVMe SSD NAS.