3 VMS, AI & gaming shared multi-GPUs + 1

bangbuck · June 30, 2024, 6:44pm

Please tell me if the following is possible for my first VFIO build, or how I should adjust the hardware or my expectations so that it is:

Proxmox host with Looking Glass [iGPU]

2 VMS always active:

Primary Linux VM (General use: code/ML/LLMs) [3090-A, dynamic 3090-B & cores]
Secondary Linux VM (Lightweight use: testing/privacy/security) [GTX 970]

1 VM sometimes active:

Windows VM (Gaming) [Dynamic 3090-B & cores]

The 2 x (used) 3090s allow for LLMs larger than 24 GB, and it would be great to also benefit from parallel or multi-GPU use for AI generally. For the best use of resources, I would like to be able to dynamically move 3090-B (and some cores or RAM if this makes sense), perhaps with a hotswap script, from Primary Linux whenever Windows is active for 1440p gaming.

Even if this is possible, I appreciate this may rule out using NVLINK as I think both cards need to be on the same IOMMU group. I have read NVLINK is not relevant for LLM inference, and not so important for DL as most homelabs will hit computational bottlenecks before interconnect bottlenecks. I hope this would also not preclude using both cards for AI generally. I have wondered if moving both cards between VMs could negate this, but this would mean another card must be passed to the Primary Linux when Windows is active, which in addition to the GTX 970 used for the Secondary Linux means a total of 4 discrete cards, which I am unsure is practical?

It was also explained to me recently that 4 cores per VM is a good rule of thumb for VRAM intensive applications, as to not saturate dual channel memory bandwidth, and that more than this does not scale linearly. I hope that once I am not doing anything intensive on Primary Linux when Windows is active, I can still take advantage of more that 4 cores in a single VM, even with diminishing returns.

If dual 3090s are unworkable, perhaps 4090s using PCIe 5.0 for interconnect instead of NVLINK are my only option? I was initially set on a single 4090 build, before I started to consider running LLMs larger than 24GB a priority, and that the best used 4090 prices in the EU are ~220% the best used 3090 prices, inference with 2 x 4090s is only slightly faster - though closer to twice as fast for other general AI benchmarks I have seen.

One reason I would like to include the Secondary Linux VM is so that along with the other VMs it would have the advantage of Proxmox delta/incremental sync or backup, and which might mean I can run it from my laptop when mobile. If it would make everything else possible I could look at running it on a SBC/SoC or some other second lightweight computer.

I was planning to build this when used GPU prices hopefully drop in the run up to the 5000 series release, maybe narrowing the gap between 3090 and 4090 prices. 7950/9950X3D (8 or possibly 16 V-Cache cores should alleviate some pressure on memory channels), on a board like the ASUS ProArt Creator B650, but I have read that the IOMMU grouping may be so good as to cause an issue with NVLINK/Multi-GPU, and if it can accommodate 3 cards they may have to run in x8, x4, and x1.

Where possible I would like to balance getting the best performance for the money with not cutting corners so much I introduce bottlenecks. Can something like this work?