LLM's/Games VRAM Nvidia/AMD Question?

It seems one of the biggest hurdles for many people wanting to use AI models is vram. I know you can use vram & system ram with ollama, but performance takes a big ding, at least in my experience. I know in games the vram is fairly closely tied to graphics processor capabilities, or at least that’s my thought.

With Nvidia’s latest card releases I thought the 5080 would have 20 to 24GB of vram but obviously not which gets me thinking they’re really pushing AI and their high end towards 5090. Also does that mean the 5080 doesn’t really have enough vram for the gpu or is right on the edge at least for games that could fully utilize it?

Would it benefit AMD to release cards with large amounts of VRAM as an alternative? I mean if AMD released cards with much higher vram, maybe at the low end you have 16GB and maybe even up to 48 or 64GB on the high end. I get it that games most likely wont use that much vram but AI would and there could be a market for people in machine learning that need the ram but don’t have the budget to drop $2K on a 32GB card.

Feel free to tell me why I’m way off base and correct my thinking.

Is vram just so expensive it would make no sense?

1 Like

I’m pretty sure the primary reason that NVIDIA (and to a lesser extent, AMD) doesn’t put large amounts of VRAM in its gaming cards is so they can sell much more expensive, higher-margin data center cards for machine learning/LLM uses.

2 Likes

That makes sense. Still can’t understand the 5080 vram spec. That one seems very odd. It should have had 24GB of vram.

I kinda want a 9070 XTX with 32GB RAM – if you’re building cards with those two binned GPU chips but can put any amount of GDDR next to it, why not? – but I don’t know how that message gets to card builders.

(Yeah, maybe the 9070 chips can’t address more than 16GB.)

K3n.

Finally, the real 5090 we were all waiting for:

:troll:

2 Likes

You’re spot on about VRAM being a major bottleneck, especially for AI workloads. While games might not need excessive amounts just yet, AI models absolutely do, and relying on system RAM is far from ideal. Nvidia’s segmentation strategy makes sense from a business perspective, but it does feel like they’re pushing serious AI users towards their high-end cards.

1 Like

Agreed, it is challenge right now to build a good dual purpose gaming and ai rig, especially with 24GB VRAM cards like 3090TI and 4090 not getting any cheaper.

As much as I love R1 671B it takes a server or workstation to run it fast enough. For running on my gaming rig I go with the 32B model size which fits nicely in 24GB VRAM with enough context to do bigger code refactors etc. The breakpoints on common model sizes make 16GB VRAM just too little unfortunately.

I wanna check out this brand new model that dropped today which claims to beat full R1 671B at some tasks:

I’d suggest the IQ4_XS quant, or at least no lower than 4bpw.

1 Like

Yeah… I have a 16GB 4080 Super that I bought to see how much I like AI and also for gaming. I’m Linux only so it was an iffy purchase since the drivers for nvidia can be problematic on Linux where gaming is concerned.

I definitely notice the difference when using larger models, usually above 13B when it uses RAM and VRAM. It can become so laggy it’s just a pain but I’m using it mostly with VSCode. I don’t actually have AI write code for me (not a great experience) but I like when it writes comments on my code and sometimes it does have good suggestions but sometimes not, like when it put an import in a loop or it added mutlithreading support to something that was already multithreaded. But writing thorough comments it’s pretty good and I’m lazy.

If I could buy some lower cost cards with large amounts of vram, I’d do it. I’d even build a dedicated AI box, but right now the video card market is nuts.

1 Like

That’s why I posted it. Just hoped someone would see it and just may think that could be a winner. Right now all the new cards except the 5090 have 16GB so I wont be buying this round.

I’ve always wondered what would it take to just add more RAM yourself? Like go from a 1GB chip to a 2gb chip. There are folks out there who perform similar upgrades to onboard laptop memory, why not a GPU? Would it need a custom vbios?

1 Like

Now there’s an idea. Make a video card where you can add vram. Not sure how that works with the drivers etc.

1 Like

You’re spot on, this is already happening. I posted a link above with a 96GB VRAM 4090 custom job. Folks on r/LocalLLaMA have been circulating screenshots of 48GB variants. I’d be weary to buy one lest it be a scam or not work though.

I’m not 100% sure, but in my electrical engineering experience, I believe that a skilled tech with access to good (de)soldering equipment could manually remove the existing 12x GDDR6X surface mount TFBGA packages and replace them with higher density modules.

the 12x black and almost square packages surrounding the core processor on this 4090 image from techpowerup

It seems like chip manufacturers are using the same physical package and P2P SM io standard across GDDR families for now. I haven’t dug up any exact part numbers from Micron, Samsung, SK Hynix, etc for a PDF datasheet. Samsung has a double density GDDR6W product that could be what they are using.

By applying an advanced packaging technology to GDDR6, GDDR6W delivers twice the memory capacity and performance of similar-sized packages.

Seems like a similar stacked wafer approach like AMD’s 3D V-Cache to fit more onto the same physical PCB footprint.

Exactly, I’m not sure if there would need to be a firmware / driver change to address the additional memory or not. It may be possible that it would just address it already without changes though.

EDIT Here we go, it is a hacked driver working only on Ubuntu running dual 4090 96GB cards for total 192 GB VRAM.

About the best I can do myself at home is like 1206 resistors/caps and barely 2mm TFQP size surface mount components manually pick and placing with tweezers, solder paste, lots of flux, a hot plate, and hot air rework soldering station. I don’t have the skills or equipment to try upgrading GDDR6X especially given its a multi-layer PCB with components on both sides.

It may also be some PCB assembly factories are making limited runs of the 4090 using replacing only the GDDR6 chips with the 2x and now 4x density variants in low quantities e.g. 100s at a time.

So yeah at the mercy of manufacturers to dole out that sweet sweet GDDR… :sweat_smile: lol

The 48GB variants are done by using a custom PCB that allows for a clamshell design, so you can have 24x 16Gb GDDR6X modules installed. The PCB is likely a design similar to the original 3090, which had 24x 8Gb modules.

No idea how they went with the 96GB variant tho, as far as I’m aware there are no higher density GDDR6X modules available.

I’m sure it is physically possible. It is more from the software/driver hack. I am going to send mt laptop to DOSdude to have the RAM upgraded.

1 Like

Its wild that we are being artificially limited by a scarcity of RAM capacity. Its as if this whole thing is a scam made to sell high end GPUs when apparently people can just bolt on their own RAM to double capacity.

Is it really impractical to have 5060s and 9060s with 100+ video memory?

I wish the wild minds developing the RISC-V makes an FPGA that has an insane memory bandwidth capacity and lets us build AI/NPU with slotted high capacity DIMM slots.

1 Like

I wonder if it was just someone mistaking the 96GB pro card that’s coming soon, as a 4090.

Nvidia RTX Pro 6000 Blackwell GPU spotted with 24,064 CUDA cores, 96GB GDDR7, and 600W — 11% more cores than RTX 5090

Are you making a pun about the startup Bolt Graphics Zeus RISC-V GPU with combo LPDDR5X and DDR5 (SO)DIMMs?

If they can pull it off in 2026, might have a new option for ai inferencing other than a rack of used gaming GPUs haha…

2 Likes

I was not aware of this. This sounds amazing!


Look at me, I am the computer now

1 Like

The reasons (maybe better “supposed reasons”) usually given why VRAM is soldered down and not user upgradeable is that GDDR RAM is more susceptible to interference due to the high data rates, using RAM slots instead would lower signal integrity and so on. Some of that might well be true, but it has often struck me as a very self-serving argument by the manufacturers of dGPUs. After all, how often are GPUs replaced simply because they don’t have enough VRAM? I believe quite often, I’ve done that.
I find those examples of successful DIY upgrades very interesting, but, at least for now, I just don’t trust my limited soldering skills to attempt anything like this. The danger of destroying the GPU with one small mistake is very real. And that’s before the question if I could manage the software know-how that’s also needed.
However , I’m definitely rooting for these guys!

1 Like

This one is a really weird design. Copy pasting what I have said in another forum:

ServeTheHome has a better post on that with more details:

Each board can have 1, 2 or 4 chiplets, with each chiplet having access to 32 or 64GB LPDDR5X, plus DDR5 SO-DIMMs, as seen in the above pictures for the 1c config.

Their LPDDR5 is rated at 273GB/s, so I will infer that it has a 256-bit bus at 8533MT/s (similar to Strix Halo, but with faster clocks). The DDR5 part is rated at 90GB/s, so I will infer they’re using 128-bit (your DIY equivalent of “dual channel”) at 5600MT/s.

Their total memory seems to be under the assumption of having 64GB LPDDR5X + 2x48GB SODIMM modules, with the bandwidth being the aggregated value from both of these.

All in all, seems like a DPU with a CPU that has a really beefed up SIMD unit. The memory bandwidth is really subpar compared to any other accelerator, and even their FP64 numbers are far from impressive compared to other GPUs that have it enabled.
It may be interesting for really specific scenarios, such as large path-traced renderings, but I don’t think it’ll be replacing any GPU farm.

For stuff like LLMs, I think it’d end up bottlenecked by the SO-DIMM memory bandwidth, which is identical to a consumer x86 platform. Going for the higher-end models (with the 4 chiplets) would likely be slower than a 12c EPYC system while also costing more.

2 Likes