I am trying to build a Machine Learning rig with 2 A6000 / 2 4090 as they get available. And also the requirement is to have multiple Virtual Machines running on it like may be 4-5 any given time. which will share these GPU’s.
So I saw this video from the channel where @wendell was talking about having running multiple VM’s on Threadripper pro 32 cores. So I am wondering with new 7950x in place does that achieve same performance as Threadripper pro 32 cores in terms of training the AI models and also running the VM’s at same time.
If by sharing you mean multiple VMs using the same GPU, be aware that sharing those 4090s might not be an easy task (if possible at all). vGPU on consumer cards is a bit of a pain. Should be doable with the A6000, but still requires a license iirc.
For model training the CPU shouldn’t be really that relevant, as long as you have a proper data pipeline with already processed data so the GPUs won’t get data starved. However, if you’re doing transformations to your data during the training pipeline, then it might become more relevant, but a 7950x should be more than enough to handle it, given its high single threaded performance.
Thank you @igormp. Oh I didn’t know that when creating virtual machines on the main machine which has GPU’s the virtual machines can access those GPU’s.
I was thinking those VMs get access to the GPU and also will share the VRAM.
But you are saying that A6000 can be shared right? When you say license what kind of license. Is it something paid if so any idea on it ?
Sadly that’s a paid feature, usually when you spin up a VM it has no access whatsoever to PCIe devices, you need to passthrough those if you want the VM to access it (which leaves the host with no access to it while the guest uses it).
Yeah, it is paid, I’m not sure if the newest A6000 already has official support, but you can read more about it here:
There are some ways to go around that license, but it requires modded drivers and whatnot.
One thing that you can do is, instead of using VMs, try to go with containers, those should work as you expect. Attach volumes with the datasets/saved checkpoints to it, allow them to use the GPU and you’re off to the races.
One thing to keep in mind is that unless you are dealing with insanely large datasets and highly complex neural networks, a dual 4090 will be massive overkill for performance that will be pretty much obsolete by the next generation in two years of time anyway.
In your case I would either aim for an Ampere 3090 Ti or 4080 16GB. The only difference will be how large of a neural net you can fit (24 vs 16 GB VRAM) and how fast you can train it (at most 60% faster but wont make much of a difference for smaller NNs as CPU, disk speeds and PCIe bottlenecks will come into play).
If you have a good grasp of your algorithm and you have decided the algorithm needs more juice then a dual 4090 is the right call, if you are not sure, save yourself some money now and spend them on a 5000 series down the road.
Thanks @wertigon. I will be working with raw Image & video footage data all the time. My every data set is around 10GB or so.
One reason I was considering 5000 series Chip was mainly because of the VM’s that I wanted to create and utilize the GPU’s. But looks like the solution to utilize GPU’s is docker. So I am still confused on 7950x vs 5975wx.
It basically boils do to if you need many cores, many PCIe lanes or more than 128gb of ram. In case you do need any of those, then go with the 5975wx, otherwise it doesn’t make much sense.
@VRC can you go into a bit more detail… e.g. where I work we distill large models down to a few gigs so that our inference workloads are good enough (ie. latency / throughput / total hardware bill when using models in serving).
10GB of video or stills doesn’t sound like a large dataset - are you sure you need all that horsepower?
(also, check, whether you can actually train two things in parallel on the same GPU).
This machine is for my personal learning and projects. At work right now we have been working on large data sets and CV tasks in parallel which is consuming around 16-18GB for the entire training & inference.
So I was thinking to start with 4090 because not to bottleneck after a while. And then I was thinking about 2 because my students who work on 4 projects at-least need 4 VM’s that can share the other GPU, those VM’s all together might not get over 10-12GB. So the reason for thinking about Threadripper pro’s is the cores & capacity of RAM.
I was thinking 4 VM’s running on same machine + my tasks might be more than 16 cores (This is my guess not sure tough) and with 4 VM’s running all time, the other thing I was wondering how much RAM do we need (if its more than 128GB then only Threadripper is the choice I guess).
The difference in price and bang for the buck between 5975wx and 7950x is huge.
If your livelihood is not at stake, 7950x should get you most of the way there and leave you with a bit of cash to spend on fancier GPUs and/or perhaps a more specialized accelerator.
If you can start with 1GPU, why not…
Good news is that ML workloads typically don’t require tons of PCIe bandwidth, … 2 GPUs would probably run fine too. (3 if you count one built into the CPU - super useful for showing a desktop).
You could perhaps just buy a second machine from the money you save too.
@wertigon But one thing I am skeptical about is if there are two different GPU’s in a machine how compatible are they like an Intel Arc & a 4090.
Also @wertigon I think one video we (me and many more AI enthusiasts ) have been looking for is training and predicting Neural Networks on an Intel or AMD GPU ( a ARC may be). Please consider this for a future video. Thanks.