Having gotten to the point that I could keep several GPUs busy around the clock running training experiments, I’m strongly considering buying some training hardware instead of renting cloud instances. With the upcoming 5090s rumored to have 32 GB of VRAM, this seems like a good time to do so.
I don’t want to overspend on the platform. For my use case, money is best allocated towards GPUs. Turin and Granite Rapids prices are eye-watering, so even though I’m going to treat this system like a server, I’m probably looking at threadripper or sapphire rapids refresh workstation parts. I’ll go for an 8 memory channel variant, because I do want IPMI. Depending on the actual launch price of 5090s, on the platform price, and my stomach for water cooling, I’ll go with 2-4 5090s, but even if I don’t get all 4 on “launch day”, I will expand to 4 cards over the course of a year.
If prices are too high, I might go with an older DDR4 platform, maybe if I can get an F-series Epyc Rome or Milan, but let’s see if I can put together something workable with W790 or Threadripper.
Given the GPU count and the (tragic) prevalence of Python in ML work, I reckon that I only really need about 20 cores, and as much single thread performance as I can get. I do use OpenVINO, and I would be able to leverage the features (AMX, AVX512) and software ecosystem that intel offers, so I’m sort of leaning W790. That said, I’m not super familiar with the options here, hence this post. This box will run Linux.
In summary I think the top factors for platform selection are cost, IPMI, single thread performance running Python, and ability to keep 4 GPUs fed. Next tier factors are the availability of things like AVX-512 and AMX.
Input is appreciated. Happy new year!