Platform selection for a multi-5090 ai training computer

Having gotten to the point that I could keep several GPUs busy around the clock running training experiments, I’m strongly considering buying some training hardware instead of renting cloud instances. With the upcoming 5090s rumored to have 32 GB of VRAM, this seems like a good time to do so.

I don’t want to overspend on the platform. For my use case, money is best allocated towards GPUs. Turin and Granite Rapids prices are eye-watering, so even though I’m going to treat this system like a server, I’m probably looking at threadripper or sapphire rapids refresh workstation parts. I’ll go for an 8 memory channel variant, because I do want IPMI. Depending on the actual launch price of 5090s, on the platform price, and my stomach for water cooling, I’ll go with 2-4 5090s, but even if I don’t get all 4 on “launch day”, I will expand to 4 cards over the course of a year.

If prices are too high, I might go with an older DDR4 platform, maybe if I can get an F-series Epyc Rome or Milan, but let’s see if I can put together something workable with W790 or Threadripper.

Given the GPU count and the (tragic) prevalence of Python in ML work, I reckon that I only really need about 20 cores, and as much single thread performance as I can get. I do use OpenVINO, and I would be able to leverage the features (AMX, AVX512) and software ecosystem that intel offers, so I’m sort of leaning W790. That said, I’m not super familiar with the options here, hence this post. This box will run Linux.

In summary I think the top factors for platform selection are cost, IPMI, single thread performance running Python, and ability to keep 4 GPUs fed. Next tier factors are the availability of things like AVX-512 and AMX.

Input is appreciated. Happy new year!

No advice, just wondering what are you training on those and what’s the ROI looks like?

If you’re going to be doing AI/ML things on Linux, use AMD. A Threadripper system can have as low as 16 cores, or as many as 96, while giving you all the PCIe lanes you could ever want. Either way, with 2-4 GPUs you will need either Threadripper or Xeon, but AMD is outperforming Intel on Linux right now between these two. The ASUS WS TRX50 and WS WRX90 both do have IPMI, as do the Supermicro Threadripper boards.

wondering what are you training on those and what’s the ROI looks like?

Largely doing research in self-supervised computer vision, and training or fine-tuning diffusion models. Overall ROI is hard to measure since it’s more on the research side than “production” work.

If you’re going to be doing AI/ML things on Linux, use AMD.

I looked into it a bit more and yeah TR does look like it compares favorably against intel in compute per dollar.

After I wrote my first post, I was thinking a bit more and I realized that actually AM5 could be a good choice too, if unorthodox, at a very substantial cost savings. Right now I know that 16x Gen 4 lanes per GPU is enough for my purposes, so 8x Gen 5 lanes per GPU is probably fine too. True that an AM5 system would be limited to 2 GPUs instead of 4, but in principle you could scale to 4 by just getting two systems, which also saves headache related cooling and power delivery (e.g. obviating the need for dual power supplies). If you want to run one train on all 4 cards, it’s obviously better to have all four in one box, but if you want to run multiple experiments in parallel (which at lest right now is my situation), then it would hardly matter.

Currently unavailable
Best on the market is 2x PCIe 5.0 x8 slots from ASRock and SuperMicro
assuming you’re using ECC, which is strongly recommended for AI workloads.

1 error in memory can very easily poison the output.

Unfortunately GeForce cards don’t have ECC on their VRAM. But yes I’ll use ECC on the host.

Best on the market is 2x PCIe 5.0 x8 slots from ASRock and SuperMicro

Yes, I was comparing the existing system that I use (a server with Gen4x16 to each card) with the potential that AM5 has to offer in the case of just 2 GPUs, and noting that the bandwidth is the same.

1 Like