Server build for AI workloads

Outrageous_add123 · August 28, 2023, 11:28pm

Hello, so as recommended by @BKV I am opening a separate topic to adress my numerous dillemas concerning build of a server dedicated for AI usage.

Specific purpose of build is working on Deep learning, e.g. training GANs and training NLP models (mostly fine-tuning larger ones).

Budget which I consider is around 2000€, at most but would like, far less(if possible) for motheboard and CPU(I am from Europe).

I will gradually upgrade the hardware, so I will probably start with 2 RTX3090 and work my way toward A6000 and others. At the moment I posses Thermaltake view 51 case and a decent PSU(it will be upgraded as I add gpus…).

I have 1Gb internet speed so like 120MB download speed at the location and I would like to make an (energy)efficient system as much as it can be.

I will most certainly use large workloads(in whatever context we consider it).

I will probably be the only user of server for some foreseeable future(1Y).

Total cost will be more than 10k€ until the end of next year(with gradual improvements)…

Questions:

Why are the new server builds(from stores) so expensive compared to the ones ‘we’ build using partpicker for example?
What is the catch with PSU-s with over 2k Watts?
What is the catch with different size PCIe slots on motheboards? Why some (premium) motheboards have all slots x16 if the riser cable should ‘Nullize’ the performance difference when used?
Is it worth paying extra hundreds of euros for those premium looking motheboards, or something like H13SSL-N(for epyc zen4) is more than enough? Will I have some notable difference between those mid-range/high-end motheboards for my specific purpose?
Is it worth going for a newest generation of a cpu, or is a better trade-off going with older one in same price range with e.g. more cores and whatnot(for example milan gen has same price for 16/32 c/t as rome for 32/64) *for my specific purpose?
What about ordering CPUs from China? What are the risks? Is it worth it?
How big of impact the AVX-512 present when comparing cpus(In % if possible)? Is the AVX-2 enough(When comparing genoa series with milan which is far cheaper atm).
What is the optimal number of cpu cores per gpu for purpose of some sort of loading batches in AI for example, or generally for AI?
Do I get Pooled memory when I use NVlink between GPUs?
Are these (server) cpus overclockable in case I need some extra Mhz?
How does genoa 9124 compare to a w5-2465 with 16/32(c/t)? How about third gen intel xeon? How about gen from 2020.?
What is the optimal bang for buck at this point within my price range for the specific purpose?
What about riser cables with pcie5.0?
Cooling? is the water cooling absolutely necessary? Some alternative?

Thanks for help in advance.

Styp · August 30, 2023, 6:50pm

I can not answer everything, but certainly a few important ML/AI topics.

There is a ton of good info on the interweb specifying a DeepLearning machine for any use case. But I can tell you in short:

Entry-level:

7800x
64GB Ram
1TB NVME
3060 12GB

The 3060 is perfect for training quite a lot of models. If you really want to dive into NLP, a 3090 with 24GB is a must. There are cheaper alternatives with older P100 32GB cards (eBay). I would stay away from any experiments with AMD - it works, but the topic is already difficult with Nvidia, so let’s focus on that. The price for performance sweet-spot is probably the 4080 16GB at the moment.

I run a powerful rig and work on ML topics a couple of days a week. Cloud services are nice once you have your model at a point to train it for production, but a lightweight workstation is perfect for experiments. I wouldn’t spend money on an Epyc or Xeon W-34xx unless you really need to. There is just no point.

If you elaborate on what you want to do in detail, I can give you a tip in the right direction.

Outrageous_add123 · August 30, 2023, 7:06pm

Thank you for your answer @Styp. As I wrote in the post I am planning on fine tuning NLP models, training GAN models from scratch and so on(preprocessing of large databases is included). Reason why I consider more PCIe lanes CPU is for multiple GPUs usage. I already posses rtx3090, but it is not enough for many tasks unfortunately, so I plan with going to a40 or a6000s in future…

quilt · August 30, 2023, 11:32pm

2000 is very tight for a HEDT system. Does that budget include memory? A 16 core threadripper or xeon w is over 1000 and the motherboards are 800-1000… Either you‘d need to go second hand or go for a consumer platform with 2 x8 slots and 2 gpus.

The 3090 is not enough in which sense? GPU memory? performance? Are 2 3090s with nvlink an option?

NicKF · August 31, 2023, 12:07am

Yeah you need to get more scrappy here.
I’m sure we can come up with some whacky solution as a starting point that will absolutely work as a starting point … but 2000 euros ain’t gonna get you anything NEW that’s worth buying IMO

But you haven’t really defined your workload and what you will be doing

Outrageous_add123 · August 31, 2023, 12:07am

Well, I am okay with used components so we can get that accounted. Rtx 3090 is not enough because it does not have enough memory. For example, adding several residual blocks crashes through it’s memory. I plan on going to several gpus, which should give me atleast 50% savings over buying A6000 kind of gpus atleast until I get more cash.

Styp · August 31, 2023, 3:47am

In my experience, if a 3090 doesn’t cut the cake - it’s getting complicated.
The multi-GPU training only makes sense if you can split the gradient calculation into mini-batches; everything else is a nightmare of a headache.

And on a side note; x8/x8 for dual 3090 is enough, the model is so big that the bandwidth limitation is not an issue because most of the time is spent on calculating the gradient, not loading the data. Maybe 5-10% performance impact, and another couple of percent for limiting the 3090 to 200-250W because the thing gets seriously hot

Outrageous_add123 · August 31, 2023, 11:44am

Yes, I am aware of that. But for the sake of argument, suppose I want to train a Resnet level network on my system. I don’t think I will go any further without use of Cloud atleast for now. Anyway, the GPUs are not the main topic right now. I am looking for a cpu capable of supplying that small cluster of GPUs with data in some reasonable time. If I understood your last sentence correctly, you implicitly say that I could go with cheaper pcie3 cpu/mobo and not loose too much of performance for my task? So for example something like epyc rome cpu may be enough?

quilt · August 31, 2023, 12:12pm

That entirely depends on what you mean by “supplying GPUs with data”. if you have all feature engineering done (i.e. you just load a dataset from disk) you just need a nvm-e ssd and any kind of semi-modern CPU. If you need to do more feature engineering, massaging data, prepping data on the CPU you need something beefier.

I’d say experiment on the hardware you have/can easily afford and take it from there. You can make DL models as large/small as you want, just by choosing the number of parameters. Take the LLama models for instance which come in versions from 7B to 130B. It does not make sense to scale to the largest models unless you are making money. If you want to learn, experiment with the smaller models; build some experience; get a job or a gig where your employer provides hardware or cloud expenses.

Outrageous_add123 · August 31, 2023, 12:25pm

Yes, that is exactly what I mean. Every little bit more advanced project requires large amount of data, and with large amount of data, there is a lot of cleaning etc. All stuff done on processor. I did a small project earlier and my pre processing took literarly days to complete. I want to be able to parallelize that across the multiple threads so to make my work faster.

Styp · August 31, 2023, 2:56pm

I am curious to see a use case where you are NOT IO limited, and you manage to starve an RTX 3090 by using a 7800x. Seriously, eighter you have an IO problem or a Software problem.

Yes PCi-E 3.0 is fine, the SSD is the only issue that might become a topic, but in the end, often the latency is a problem on the IO side, not the throughput. You load single files from the disc.
In the end, get the CPU that is the best value; it doesn’t matter as much as we think it does. How much data are you trying to process? And is the encoding correct, or are you artificially inflating the whole thing before loading it to the GPU?

Outrageous_add123 · August 31, 2023, 3:21pm

I am not concerned about that, I am mostly concerned about the questions which I asked above. I can get ‘tr 3955 pro + asus mobo’ or ‘epyc milan 7313 + supermicro mobo’ for 1.5k atm but I am wondering, is it too much for start. I definitely feel that I need more slots for gpu so the consumer hardware is out of question. Used epyc rome+supermicro mobo(7302) is like 400€ cheaper atleast which I could use for extra ram and storage. There is also option with intel xeon but with pcie3 and asus mobo, same at 1.5k. All of them with 16/32 c/t. So the older gen should not pose some real issues?

Styp · August 31, 2023, 6:38pm

It really doesn’t matter. The only thing I am curious: In Which case do you want to squeeze more than 2 GPUs, max. 3. a 3090 is 2.5 slots at least!

Outrageous_add123 · August 31, 2023, 7:15pm

Thermaltake view 51

MikeGrok · September 12, 2023, 12:17am

my build thread based on the epyc 9124 is here:

The CPU is $1200
The Motherboard is $800
32GB sticks of ram (parted out from 25 stick kits) are around $80 each.
You need a boot drive. You may have parts or you may need to buy something. Don’t make it USB, sata is fine. I got a pair of optane, and recently a larger data store SSD after using spinning rust for a few weeks. I was able to see that yes, it is the bottle neck.

Get a good Power supply that can power your CPU, and motherboard. The motherboard requires 2 8 pin connections in addition to atx+4.
If you need more wattage than your power supply can accommodate, you can use a power supply combiner cable. It uses a few pins on the atx cable to establish ground and one pin to determine if the computer is on or off. After that add a power supply every other GPU.

https://www.silverstonetek.com/en/product/info/power-supplies/PP10/

1Terabyte at 1 gigabit is 8000 seconds, or a bit over 2 hours. at 10 gigabit it is 800 seconds, or under 15 minutes. Many models are 1 TB. How patient are you.
the NT model with dual 10GB ethernet does not take pcie lanes away from anything on the N model. 108 pcie lanes are used on the NT model, 104 are used on the N motherboard model. But be aware that there are only drivers for that 10G chipset available for linux and windows server. There is no windows 10/11 drivers.

You can get x8 pcie riser cards, or just get cables from the oculink ports to pcie slots, like:

I know that there are others that don’t require an ATX cable. I used a similar one for about 10 years that brought the pcie m.2 out of an intel NUC so that I could use an nvidia GPU. The genoa supermicro epyc motherboard has 3 pcie5 8x oculink ports.

In total you could theoretically connect 8 GPUs at 8x or faster.

I just got an SSD because the hdd was taking 6 minutes of disk io per 1 minute of gpu compute just to load the models for stable diffusion. I got:

$369 6.4TB intel D7-P5600. This is the pcie4x4 high write endurance model for 5 years at 3dwpd. And it was equivalent price of a samsung 980 for the amount of capacity.
I also got the u.2 cable to oculink on the motherboard. This drive should load the model in 2 seconds instead of 6 minutes.

BTW the bandwidth to main memory on the Genoa is 460GB/sec. On the previous models it was 80GB/sec max.

Outrageous_add123 · September 12, 2023, 3:39am

Hello. Thank you for your reply. However, I already made a decision to go with milan build.
CPU) epyc 7313 800€
MOBO) H12ssl-i-o 600€
RAM) 2*32 non ecc 3200Mhz 160€
PSU) Be quiet pure power 12m 1000W 150€
CASE) Thermaltake view 51 200€
CPU_FAN) Noctua nh-d15 90€

I aim toward 7373x from milan-x in future or similar.
Also, I am not quite sure how to set 2 psu in my case, but will have to come up with some idea soon. The thing is that I do not believe the cpu is that important at this point, in other words, 5% on gpu performance improvement outweights 15% on cpu, atleast for my use case. Do you see any possible bottlenecks over there?

quilt · September 12, 2023, 7:13am

That cooler is not compatible.

Outrageous_add123 · September 12, 2023, 8:10am

yeah, it is NH-U14S. Typo

SquirrelMan5k · September 12, 2023, 5:13pm

the https://www.amazon.com/Noctua-NH-U14S-TR4-SP3-Premium-Grade-Cooler/dp/B074DX2SX7/ would be a better option. the NH-U14S (non-tr4/sp3) is meant for smaller chips, like ryzen and intel (aka non-hedt)

Outrageous_add123 · September 12, 2023, 5:33pm

Well, I also considered water cooling since my main idea is to try get those clock speeds as high as they can go and haven’t found any viable water cooling that fits sp3, but this with Noctua seems to be a second best option.

Server build for AI workloads

Hello. Thank you for your reply. However, I already made a decision to go with milan build. CPU) epyc 7313 800€ MOBO) H12ssl-i-o 600€ RAM) 2*32 non ecc 3200Mhz 160€ PSU) Be quiet pure power 12m 1000W 150€ CASE) Thermaltake view 51 200€ CPU_FAN) Noctua nh-d15 90€

Hello. Thank you for your reply. However, I already made a decision to go with milan build.
CPU) epyc 7313 800€
MOBO) H12ssl-i-o 600€
RAM) 2*32 non ecc 3200Mhz 160€
PSU) Be quiet pure power 12m 1000W 150€
CASE) Thermaltake view 51 200€
CPU_FAN) Noctua nh-d15 90€