New guy LLM home lab

Ivan_Kalugin · November 29, 2025, 7:37pm

In my understanding mixture of experts models are less sensitive to this, otherwise yes, for sure, high performing (ie bandwidth) memory is a must.

H-i-v-e · November 29, 2025, 9:09pm

Problem is larger models are slower as is, since you have to load a greater total of parameters and do a greater total of calculations. I have the suspicion that a 395+ will be too slow. Maybe if you have the possibility to return the product you could buy and see for yourself if it is fast enough.

Burhan · November 30, 2025, 2:25pm

Definitely.

Sounds like you’ve got a solid plan. Realistically, if you want to do a trial with a smaller set of agents, you might be able to test on your gaming PC. You won’t be able to fit massive LLMs, but probably a 12-14b model would fit with a reasonable sized context window. I also would recommend testing out smaller models, since if you find you could use a smaller model, you might find that you could run at least two in parallel, meaning you get to the end of the turn quicker.

I’ll also advocate again for a compression agent. At the least, I’d recommend testing it out after you have the core workflow figured out. One of the projects I’ve worked on involved using multiple agents for parallel research. The amount of data returned was massive, usually on the order of 100-200k tokens. We introduced a compression agent that distilled the information into ~20-50k tokens with a negligible loss in quality for the final output, but also a massive speed and cost reduction for the final agent (synthesizing the final artefact).

Danne980 · November 30, 2025, 8:45pm

I have never played RPC but will all of thoose NPC charactes need full “focus”. 1 million tokens are alot. I just did a test using LM Studio and Ernie 4.5 21b-a3b that should be a pretty fast model. On a dual RTX 3090 system that gave me 120 t/s. 1 million tokens in that speed is still over 2 hours. A much smaller model like lfm2-1.2b is just a bit faster (165t/s).

The first most important thing for a usecase like this is to use vLLM i think and try to build the game so requests can be sent in parallell as much as possible, if you can use 10 or 20 parallell request at almost full speed the numbers look much better.

I think your best bet is to get an RTX 5090. That card is about the best you can get on smaller models. RTX Pro 6000 Blackwell is even better but you still can´t run very big models for the speed you need.

I would start with that, its pretty simple system with just a regular GPU, try to find a model with as much speed as possible with decent intellect. Then build the application from that. You also have an upgrade path (especially if you have 2 or more PCIe 5.0 ports on you current mb. You can get another rtx 5090. All other solutions are either complex, not fast enough or very expensive.

Someone else mentioner Mi50 32Gb. I am sure that is a great tinkering card but its still limited by speed. With a RTX 5090 you can reach high speeds on smaller models. Mi50 will crawl compared. If you have a good motherboard you can use multiple of those but still with four of them you still will probably not be as fast as rtx5090 and have a much more complex system. For sure you will have 128GB VRAM and for regular applications that are nice but that size of models with mi50 will not be fast for your case.

Ivan_Kalugin · December 1, 2025, 9:33am

I can always use it as a family media center (I have a strong dislike for smart TVs and it would be perfectly adequate, overpowered even, to handle that job heh).

Ivan_Kalugin · December 1, 2025, 9:36am

This is a reasonable suggestion, but the action requests (stuff that describes what an agent wants to do in a turn) do not seem to be too large for context windows and you would need a worst case scenario (all 100+ agents in one location generating action requests with multiple items) to saturate a modern LLM.

Still, this is an interesting edge case, thanks for bringing it up.

Ivan_Kalugin · December 1, 2025, 9:44am

1 million token figure is an estimate for total quantity generated between all the tasks used to process a turn. Tasks are processed in parallel.

Generation of action requests is done in parallel (for each character) via a cue, resolution is done in parallel (for each location) via a cue.

daemond · December 3, 2025, 11:09pm

I was thinking about this token requirement. You could do some kind of distributed computing like folding@home project to generate these tokens… assuming that you design the game such that number of tokens required scales linearly with the number of players.

In other words, you harness some power of the gpu of each player to run the agents.

Ivan_Kalugin · December 4, 2025, 4:17pm

This is an excellent idea, thanks!

The generation is already largely parallel, I can spread tasks between players.
Ie if we have a party of 5, it would spread NPC character sheet generation (and more importantly plans) between them to a much more manageable degree.

From my testing I have some news:

DeepSeek 3.2 is adequate for most tasks (particularly with supporting elements like RAG, json schemas, etc)
For a full scenario the token load before first turn is resolved is 2,5m tokens (800k for world building, 1700k for the initial plans set up)

Deepseek 3.2 seems like a model that could be run from home and a fairly small one at that, but I didn’t manage to find benchmarks for it (ie tokens per second on common hardware).