Help choosing a ML-ready GPU

dlsniper · April 15, 2024, 2:16pm

Hello!

I’m trying to get into the ML field as a programmer and user of existing tools/models.
However, choosing a GPU feels very overwhelming.
From my understanding, the more VRAM, the better. But the card’s FP8/FP16/FP32 and even FP64 performance counts, too.
After watching Wendell’s video showcasing the Nvidia RTX 4000 ADA, the “TINY 70 watt RTX 4000 ADA GPU For AI, Docker, Plex, Jellyfin, and MORE!” video, I’m considering getting one. However, the price is fairly close to the AMD PRO W6800, which has 12 GB more VRAM. There’s also the newer AMD PRO W7800, which has much more TFlops than the previous generation card.

I can’t find any resources about the W7800 or any benchmark for its performance in the ML space.

Should I use an AMD GPU or stick to the Nvidia offering?

I also have an EPYC 7252 and about 96 GB RAM (ECC) available for this type of project/machine.

Thank you for your help figuring this one out.

quilt · April 15, 2024, 2:53pm

Unless you need the pro features (ECC ram, … ) or the combination of low TDP with a lot of RAM, get the 4090 for about the same price or a second hand 3090 to save money and get the 24GB ram.

dlsniper · April 15, 2024, 3:06pm

Believe it or not, the 4090 is more expensive than the W7800 where I live.
And it has less VRAM than it, not by much, but still 8 GB less.

So, it’s also more of a CUDA vs ROCm question.
I see ROCm improve every month, which gives me some hope.

If Wendell ever sees this thread, here’s a video idea: A primer on getting into AI from the hardware perspective.

Thank you!

quilt · April 15, 2024, 3:32pm

Go for CUDA. Rocm is improving, but support depends on the application. If you want to learn and experiment you don’t know yet what you will settle on and CUDA will give you more freedom. Everything will run on CUDA for sure.

While more VRAM is better, you probably don’t need 32 or even 24 just to get started and learn. If you want to get a job in the field, knowing how to run on containers and VMs in the cloud is also a very valuable skill to learn/have so you could split up your budget between a cheaper card for prototyping and cloud credits.

thecoderx · April 21, 2024, 5:08pm

I agree wholeheartedly with all of this. As someone who got into learning ML 9 months ago and had to make the CUDA/Rocm decision it was mostly about if I wanted to be spending more of my time learning ML and less worrying about hardware/software compatibility. Hence why I chose CUDA. We all want Rocm and AMD to succeed in this but I don’t think that means that us as newcomers to the industry need to be a part of that process yet. Also, I think at the current time knowing CUDA is a more valuable job skill as their hardware is just more ubiquitous.

Iron_Bound · April 21, 2024, 10:21pm

What are you looking to run?

large language models
reinforcment learning
stable diffusion

Cuda and alot of vram is good for beginners,
you can also rent gpu’s by the hour to start.

thecoderx · April 22, 2024, 12:40am

NVIDIA has made this harder and harder to achieve as they try to separate their consumer/workstation level cards from the datacenter cards. I bought a 4090, but I really have a hard time calling a ~$2000 GPU a ‘consumer’ card. Especially considering that depending on task it performs as well as the Quadro card, that is 4x the price and has 2x the vram but it is still GDDR6x on the same size bus. That 24GB consumer card from AMD would be nearly half the price.

So when you look at NVIDIA’s lineup the price for performance and vram just isn’t there. But they have the complete ecosystem right now so for most individuals and businesses it is the best solution. Hopefully that tide is turning with the maturation of Rocm, Infinity Link, etc.

Iron_Bound · April 22, 2024, 6:28am

Wishful thinking doesn’t fix the software, it’ll take someone buying the hardware and doing the work… because AMD didn’t bother to give ML projects access to CI with the hardware.

Fact it’s been more than year and the flash attention repo still has no support on the 7900xtx.

The split between RDNA and CDNA was a dumb move that has caused a bunch of issues for me personally for software support.

Triton only has Wmma support in a branch at 44TF vs the 4090 at 150TF performance in benchmarks.

Outside the mi250 and mi300, I wouldn’t waste my energy on ML with AMD

quilt · April 22, 2024, 11:24am

I agree with everything you said.

The software support AMD is giving is focused on the CDNA/instinct series. They might give more VRAM per dollar to consumer cards but software support isn’t there. And they don’t offer more than 24GB either for any consumer level card.

NVIDIA is giving software support for all cards. A good move, since anyone can get started with cuda cheaply and move up when they get professional.

thecoderx · April 22, 2024, 2:03pm

Very valid point, I don’t understand why they ever split them. I haven’t experienced this firsthand but it was another reason I bought an NVIDIA card.

I agree. I do think people have to buy the hardware and work on the open source software I just don’t think it needs to be those of that are newer to the industry that is what I was trying to say in an earlier post.