Cheap used GPUs for multi-GPU hobby ML rig

So I’m building a ML server for my own amusement (also looking to make a career pivot into ML ops/infra work).

Most of what I do is reinforcement learning, and most of the models that I train are small enough that I really only use GPU for calculating model updates. Model inference happens on the CPU, and I don’t need huge batches, so GPUs are somewhat of a secondary concern in that context.

That said, I have some cash left over in my budget, and I’d like to throw in an array of 4-8 GPUs for supervised learning. I kind of want a mix of VRAM size and performance, as I expect that when not doing RL, I’ll be wanting to play around with LLMs, diffusion models, and a few other “transformery” things.

I’ll also be giving some remote friends access to this rig via Jupyter, so being able to fork off a GPU into a VM or container is appealing.

My absolute max budget for this is around $3,100 USD, though I’d like to keep to around $2,500. I live in NZ, so figure around $150 or so for shipping, plus 15% markup due to import taxes and brokering fees.

At the moment I’m debating between the A4000 and MI25 cards. Pros on both are that they’re cheap and they have fast VRAM. Unfortunately neither supports memory pooling via NVLink or Infinity Fabric, but cards that do are either too power hungry, or too far out of my price range.

I did have a look at the A5000 and MI50/MI60 cards in an attempt to get something that could do memory pooling. A5000 is still a bit too expensive, and it’s not clear whether infinity fabric link is actually supported on anything earlier than the MI100, even though the physical connector is there. Also the only four-GPU infinity fabric link bridge card that I could find was like $600USD (same price as the only two-GPU one that I found).

I also had a look at Pascal and even Maxwell Tesla cards. The P40 is cheap as chips, but also doesn’t NVLink, and doesn’t have quite the performance that I’d like. I could go P100 for NVLink, but then I’d be in a price range that’s similar to used A4000s, and that perf boost looks appealing.

I’m currently leaning heavily toward A4000, but the price of used MI25 cards is really appealing. Perf per watt isn’t great, but I could learn to live with it.

Most of the feedback that I’m getting from other hobbyist ML practitioner friends seems to be to avoid team red, but I have a feeling that I may get more learning framework internals XP out of taking the hard path with ROCm.

Does anyone on this forum have a good amount of experience with both vendors? If so, what would you advise?

Also in general, is there anything that I should be considering that I’m not? If it helps any, I’m a fairly experienced software engineer (20+ years professional experience) with a fair amount of time in the saddle in ops and systems engineering roles.

1 Like

Depends on the type of ML you’re doing stable diffusion has something that compresses ram for Nvidia

AMD does has more VRAM for the price, you can even get a 32GB version



Specs aren’t everything though as a Radeon VII is a lot faster at stable diffusion at relatively the same flops

Yeah, so it’ll be a pretty broad mix of things. There are some specific models that I want to train for reinforcement learning stuff (policy pretraining, classifiers for use in reward functions, etc), but otherwise it’ll likely be a hodgepodge of generative models.

There are a lot of efforts going on at the moment around things like quantization, hypernetworks, etc, and all of that is definitely lowing the barrier to entry, but I think I’m mostly looking for the Pareto optimal setup that maximizes performance and total VRAM while still fitting in my budget.

Oh, I wasn’t aware of that. That’s a really interesting tip, thanks!

I see from your tags that you’ve worked with the Instinct cards? If so, how difficult/annoying/cumbersome did you find the process of setting up ROCm and the major learning frameworks as compared to nVidia cards? Have you found the software to be stable enough once it’s set up? Do updates happen frequently and are they smooth/easy to keep up with? Have there been any major performance changes over time with driver or ROCm updates?

Finally, on the whole - if I told you I am planning to grab a pile of MI25s for the above purposes instead of the A4000 cards, would you think that’s a good idea, bad idea, or something in between?

so the mi25 might be towards the end of rocm support, its technically not supported anymore but still works on at least 5.2
pytorch recently updated to 5.4.2 so I need to test that

cuda just works its kinda wild how easy it is
with rocm you need a hyper specific kernel version for it to install the driver completely and correctly which can be found on amd docs

the docs keep switching portals so finding documentation is sometimes a pain

if you aren’t a linux jedi master (I’m not) the hyper specific kernel can keep you on really old OS versions like 20.04.4 for rocm 5.2

as someone who has dabbled in both sides I absolutely don’t get why people say Nvidia linux drivers are trash when AMD exists

I made a guide for mi25

not sure I literally learned how to use linux and rocm about 4 weeks before I made that guide

stable diffusion has been

so A4000 is just below the line where nvidia stops gutting cards of features, its FP16 isn’t unlocked and its double percision is gimped

I am not sure how it compares in real world against a titan RTX

it’s about 175$ more at 675$ and has 24GB and a load of tensor cores, but supposedly gimped double percision

3090 is also about the same price at 680$ and stronk

here is a handy AI benchmark site to compare

I think @oO.o has a a4000 for ML

2 Likes

I didn’t even know that Radeons were legit contenders for Stable Diffusion. Last time I checked it was mostly CUDA or bust (ok Apple Silicon and Onnx now, too but in general that always seems to be more work).
Have to search for some more pointers I guess. But the kernel acrobatics would turn me off.
Just splurged with a RTX 4090 but affordable alternatives are highly appreciated. Especially with a decent amount of VRAM.

I think that rules it out for my purposes, then. I am somewhat of a Linux Jedi (former kernel dev from back in the bad old 2.6 days), so the need to run a specific kernel version doesn’t really put me off, but with the money that I’m spending on this rig I want to be sure that I’ll be able to run newer models and ML framework versions for a while to come.

Yeah, I did briefly consider a quad 3090 setup, but then I remembered that they’re triple width cards. Unfortunately I’d only be able to fit at most two in this system (Gigabyte MZ72-HB0 in a 4U Rosewill chassis). I could get around that with an external GPU chassis (e.g. a Cubix xpander), but that would eat dramatically into the budget. Unless I’m mistaken, I think the only other option to fit four triple width cards would be creative application of risers and heavily modding my chassis.

Those tensor cores do make it tempting, though.

Oh crap, this reddit post just made me face palm. The obvious way around this is to water cool them. I’ll have to do some shopping and see if that’s feasible, though.

May also need a larger PSU. Right now I’ve spec’s a 1200W one, as I initially wasn’t planning on dumping a bunch of GPUs in this rig. May need to bump that up to 2kw just to be save against transients.

Definitely something to think about.

Gigabyte had a blower style 3090 that could be a contender. but i think nvidia pulled some strings and stopped gigabyte from selling those models.

Dr.Kinghorn at Puget systems ran a quad 3090 setup with those blower cards. he had to tune the cards to 180w using nvidia SMI to keep them in check.

I am sure some Chinese vendor has a blower cooling solution for the reference 3090 cards.

2 Likes

Memory pooling isn’t really a thing with either ML framework (keras/tf nor pytorch), you can either split your batches to both GPUs (so effectively doubling your batch size during training), or split parts of the model if possible, so I wouldn’t worry about that. Also, NVLink won’t net you much benefit if you only have 2~4 GPUs.

Don’t go for anything pre-turing, tensorcores are amazing perf-wise for ML and also allow you to easily use FP16, which effectively halves your vram consumption and makes things go vrooom. Even my previous 2060 Super was able to beat a P100 due do its tensor cores (as long as the model did with within its VRAM, ofc).

ROCm is a pain to get working with, and IMO you’d have way too much headaches trying to run anything not as popular as stable diffusion. However, my experience is limited to when I had an RX 480, so your mileage may vary.

I have a 3090 and can recommend it a hundred times. 24gb of vram, cheap if you manage to get some used, and there are blower models as mentioned above. Power limit them to 250~275W (that’s what I did with mine) and you have an awesome perf/W.

1 Like

Renting GPUs or Google colab can be a quick way to do machine learning!

1 Like

I do and no complaints. Afaict it’s the cheapest way to get 16GB of VRAM on a single card within the past couple nvidia generations. My main need was for fine tune training on Real-ESRGAN which used very nearly all of the 16GB (in my system, the A4000 is not the primary GPU for the system, so all the VRAM is available for ML). Training on ~1000 images took about 8 days uninterrupted with good results.

My A4000 was around $500 USD shipped from eBay, most likely from a mining operation. I recommend looking for mining cards sold in large quantities from sellers with good ratings as they are less likely to have performed extreme overclocks or used some hacky cooling solution.

I went with 5 A4000s. They’ll hopefully be here this week.

In the end it came down to 3090s vs A4000s. With my motherboard I’d only have been able to fit at most two 3090s, unless I went with a single width water cooling block, but that bumped up the price tremendously. The A4000s are single width, only consume 140W of power each, have ECC, and IIRC the aggregate performance and tensor core count came out similar to 3x 3090s (although in practice I imagine scaling inefficiencies will have the array performing a bit worse). It was also a touch more total VRAM, which is also nice considering that I’ll likely be doing mostly LLM fine tuning and inference stuff in the rare cases when I actually manage to load down the full array.

Thanks for the insights, everyone!

2 Likes

How’s the machine/project going now? Did you get 5xA4000 set up?

Yes, I got it set up, although I ran into some cooling issues with the ethernet controller, as it’s located between the edge of the board and the PCIe slots. I have a post about it, and how I resolved that issue here: How to actively cool the ethernet controller on a Gigabyte MZ72-HB0 v3 motherboard?

I’m also running into some cooling issues with the Dynatron A39 chillers. I’m just about to post another thread on that topic. I’ll reply here again with a link when I do.

p40 for 190$ish
basically a Titan Xp with 24GB vram

I’m all set now, but good to know for anyone else finding this thread later on.

Although in saying that, bear in mind that if someone is wanting to run any of the new FOSS LLMs that are out there, the P40 cards lack tensor cores, and therefore can’t do mixed precision tensor work, like fp16, fp8, etc. This means that you’ll need (I think) twice as much VRAM to run the same model on Pascal than you would on architectures with tensor cores (Volta, Ampere, etc). For the larger models this will also mean that you’ll need to run the model across more cards (usually twice as many), and the resulting additional bus overhead will slow things down that much more, not to mention the additional power consumption.

But all of that said, if you’re on a budget and messing around in a home lab, there’s still lots of fun to be had with these older Tesla cards.

Finally got around to creating that thread. It can be found here: Notes/Build log: Dual Epyc 7773X DRL server thermal issues