A lot of these new LLMs are requiring huge amounts of VRAM to run because of their parameter counts. There are ways around this like offloading to system RAM or even disk but obviously end up being incredibly slow compared to having the entire model in memory and any training/fine tuning has to be done with the entire model in VRAM (at least for now). Right now I’m getting by with a 3070 and a 3080 together getting me to 18GB of VRAM with fun hacks to get the models to work on multiple cards.
Is it worth picking up something like a K80 for $80 or a P40 for $200 off of Ebay just to get to 24GB VRAM or is buying a 6 or 10 year old enterprise card just going to be a dead end? 24GB seems to be a nice sweet spot for not costing too much but also able to run the 13B parameter models at 8bit quantization. The alternatives seem to be paying Google Collab for $10 for 100 “credits” which seem to be pretty nebulous in value based on some looking around or go to a cloud provider like Linode and pay $1.50 an hour for an A100.
I had some posts in this thread that may or may not be helpful
I am using 2x RTX 3090 with NVLink though, admittedly, have not actually used them for much yet
I think you should seriously consider just paying for time in the cloud. If you are just messing around with running models, you will likely never recoup the cost of buying your own hardware for this.
Personally, I am not into this type of thing, but have you looked at the Nvidia Tesla P100 it is only 16GB ram but with HBM2 type ram and the cost is the same as the P40 for just under 200$.
I have both the P40 and the P100 and the P100 outdoes the P40 running simulations for F@H.
The K80 acutally is two Kepler Chips with 12GB Each on one card. So its not 24GB “in one peace”, but two seperate 12GB GPUs…
While K80 is not too bad in FP64 (milkyway@home), its Kepler cores only support older CUDA stuff (ompute capability 3,7 if i remember correctly), there is no FP16 support (so inferences like stable diffusion ertc. will only ron on FP32) and there is no display output, you need a seperate power adaptor and extra cooling… Also, your workload needs to be able to make use of the two GPUs (which a lot of inferences do not, BOINC does), so arguying that one has a double GPU only is valid in some rare cases.
Gaming on these in a VM is possible, but not very powerful, the DX Feature Level is only 11… Given the game power, games that NEED 12GB of Vram are not recommended anyway - so for gaming, an older GCN or Maxwell consumer Card will give more power and be less installation hazzle for the same money… Prices have dropped since last year, I just bought an R9 290x for 50 bucks which is so much faster than one single K80 core in gaming…
I am not sure how long the BOINC Projects like Milkyway will still support these older Cuda versions - as long as they do, the K80 is fine for FP64 workloads (actually faster than most newer GTX/RTX Cards), but everything eles - its too old and slow.
Personally, I´d only get one if the extra effort to get it running is no problem for you, if you get it cheap and you have explizit use of its FP64 power and are fine with older CUDA Versions… Otherwise - stay away from them.
P100 has good FP16 power, VEGA base cards are good…
If you are looking into macine learning inerence stuff like upscaling or stable diffusion - don´t bother for anything not having Tensor cores… A 12GB 2060 or 3060 will easily outperfrom even a compute beast like a Radeon VII as soon as the tensor cores are supported.