New guy LLM home lab

Hello everyone!

I am a bit new to CS, but I am looking to setting up a home lab for hobby AI projects.
However I am, unfortunately, not very experienced in this area and would welcome assistance.

To start off, what are the reasonable expectations from a home lab?
Ie I know that running an LLM at home is possible, but how much performance could one, reasonably, get from a home lab?

1 Like

Sorry for my answer, but your question is too broad (U$1.000-U$200.000 or more). It depends on what you want to do (inference, finetuning, training, agentics, etc., models size and type) and your budget. There’s a lot of information on the web, set your goals, do your research and evaluate the hardware and budget you need.
You can have lots of performance, but at a price. My two cents
Just as an example to start:
AI Home Server 24GB VRAM $750 Budget Build and LLM Benchmarking
Strix Halo Mini PC vs DIY Build – Does it make sense for Gaming? For AI? What for? - YouTube

High End discussion:
WIP: Blackwell RTX 6000 Pro / Max-Q Quickie Setup Guide on Ubuntu 24.04 LTS / 25.04 - Wikis & How-to Guides - Level1Techs Forums

3 Likes

There’s a lot of “it depends” there. Given your self proclaimed newness, it might be worthwhile to search around a bit and read.

https://forum.level1techs.com/search?q=llm

Try to understand that such open ended questions are challenging to approach or provide solid feedback about. It can generally be helpful to share these bits of information when you’re asking:

  • What hardware do you currently have (if any) you could use?
  • What’s your budget?
  • What types of things would you like to do with your homelab/self-hosted LLM(s)?’
  • How much experience do you have with Linux, Docker, shell scripts, programming, etc.? (I know you said you were new, but doesn’t mean you don’t have any experience)
  • How much time do you plan to spend working on things?

This will give better context and make it easier for others to give you a reply. Otherwise, you might not see any replies or lots of time could get wasted on advice based on assumptions.

2 Likes

I share that I self-host LLMs. I try to balance low maintenance with squeezing the most functionality from what I have. I started out with Ollama and Open WebUI b/c there’s not too much I need to tweak with both, and they were easy to setup (for me), but there are lots of reasonable options available.

My most common use cases are using the LLM to do web research for me, OCR on scanned documents, classify/categorize documents, questioning LLMs about legal documents (contracts, agreements, policies, etc.), workshopping code ideas/projects, and I even used it to help with aligning my resume to job descriptions. Lots of people do a lot of different things with their LLMs. Without coding, you’ll be a bit more limited, but you could try to use something like n8n to help with more automated items if that’s your interest.

I would recommend starting to learn Docker and bash commands. These will help a lot along the way, and once you have the LLM set up, you can even ask for support with both. From there you’ll have to do a lot of exploring and experimenting, but there’s a lot of enjoyment and learning you’ll gain from doing it. As you get more experience, you’ll have more ideas, and continue to grow and learn more around using your LLM and homelab.

1 Like

I see, thanks for the input!

To clarify I have:

  • a gaming PC (rx 6900xt, 5800x 3d, 32gb RAM)
  • some experience with VMs, linux, Python, Hugging face (my objective was to set up a local translate model and it worked, sort of)

I am considering investing into a dedicated local machine, ie a Framework desk top, perhaps a mini cluster of them if it makes sense.

My use case - I am working on an AI powered RPG, where NPCs are driven by LLM powered agents.
Currently, being a neophyte that I am, I am trying to build one with gemini-3, using google platform.

The problem to me seems to be that with my preferred approach (numerous agentic NPCs creating a living world) I would need to run many (currently seems to be around 5, but I am not there yet) inference tasks per each game turn (~ 5 turns per in game day), which with many NPCs (say 125) seems to be a lot of tokens (625 tasks, say 1m tokens).

(The above is very rough intuition level estimates, hopefully I can get it to be much, much more efficient)

(Sorry for split posting)

Hence my question - how much can I do with a local machine or a mini cluster, ie if I could do 100k tokens in a reasonable time (5-10 minutes), then my project may be viable, as I could try to optimize the workflow/process (ie through heavier use of scripting)

Hardware-wise, I’d suggest a workstation/server grade mainboard with as much ECC RAM as budget allows and 3-5 GPU’s, each with 20GB RAM or more). As others indicated, you’re getting close to 6 figures if purchased new (you’d also need a rack case, PSU and networking stuff) and bad timing on the used market means that won’t reduce your costs much. And then comes the electricity bill :roll_eyes: AI is very, actually extremely power hungry. In fact, if your mains only has 120V AC option, don’t even bother. You really need 240V power to have a fighting chance running it doesn’t trip your fuses. :money_with_wings:

HTH!

:wave: :netherlands:

Thanks!

I have 240v power available, so that shouldn’t be an immediate issue.
If I have a highly parallel work load, would a large cluster of smaller machines, ie an AMD 395+ farm be viable?
With each element in the cluster processing it’s own task separately?
Networking shouldn’t be too much of an issue, atleast from my very limited intuition, as task results could be exchanged a synchronously and are relatively small in size.

For a homelab given your idea, I think you should aim for something that you could scale out. When you are at this very early stage of planning, I would not recommend getting multiple 20GB+ gpus (a substantial investment).

You could instead develop your idea using one or two agents on a single (probably CUDA capable) GPU. You may even be able to slip it into your current gaming pc depending on how many lanes you have available.

NPC AI has been pretty good historically without all the GAI bells and whistles, so it’s unclear to me that you’ll require as many resources as are being discussed here.

I have 4 pci-e lanes available but no space (despite a full ATX MoBo I have a big chungus GPU that doesn’t leave much space)

I suppose a 395+ box can make sense as a starting base, it appears that it could run high end LLM (ie the Open AI one) and I could use the experience gained working with it to scale to a cluster.

Does this make sense?

1 Like

Hi, I could probably give you advise on that. I have a workstation with an RTX Pro 6000, and an AI Server too. Can you please answer me the following questions:

  1. Do you want to do inferencing (running LLMs) or do you want to develop AI software too (and use tools like CUDA, ROCm, PyTorch)?

  2. How much are you willing to spend?

Thanks for your input!

  1. I plan to focus only on inference for this specific project (maybe others, if they pop up)
  2. This is a good question, my willingness to spend depends on how far my money would go

Right now, based on the responses, it seems that starting small (ie one 395+ node) and then scaling up (ie to 5-10-30 nodes) could be a way to go.

Do you have plans to sell the game once it’s finished? Is there a business plan in place? Do you have investors to keep happy?

Real time or “offline”? The answer will make a big difference. If it’s realtime, you’ll need a lot of hardware to make it work well and reliably. If it’s offline, meaning that no one consumes the output immediately and/or is not waiting on the output to do something else, then you might be able to make it work with less hardware and some clever orchestration. I’m going to assume for now that you mean offline (since you mentioned a timeframe of 5-10 minutes).

Context engineering and shared state management are going to be a the big challenges I suspect. I use a Qwen3 1.7B for web search queries with a 32k context window and it’s using ~3.6 GB of VRAM, and I’m using an 8-bit KV cache and Flash Attention enabled. Larger context will obviously take up more VRAM, but my question to you, would be; do you know for certain you need that large of a context? I’ll assume your 100k context is definitely needed for now.

Trying to do that with multiple agents at the same time will eat up a lot of memory quick. I don’t know how much information you’re expecting each NPC to generate, but I presume there will be some shared information between them, so you’ll also have to figure out what makes the most sense for tracking that across NPCs (and the world), and there will be some sequential dependence that will add up for more wait time. Using multiple smaller models could mean you could rely on one or two cheaper-ish GPUs (or the 395+ with sufficient RAM), and maybe you scale the model size based on the NPC intelligence?

One thing you could try to do to help reduce the context size for shared information across NPCs, is to use a “meta” agent to condense the output from one as context for the others. I actually think this would be good too, b/c it means that (just like real life) there will be miscommunication and incomplete information about what was said or done. So if one agent outputs ~20k tokens, you could use the meta agent to “compress” the output down to ~5k tokens (could vary depending on your prompt and constraints) to use as what’s passed to other NPCs. You’ll have to experiment with what works best, but lots of smaller models can summarize well and have sizable context windows. For tracking NPC outputs and/or world state, I’d guess that a database might make the most sense, but that’ll be up to you (usually a lot of personal preference there too) to determine what makes the most sense.

So I would aim for trying to use a few small models for most NPCs. Include an agent to handle context compression. Include a few larger models for the more “intelligent” or important NPCs, but know that without a dedicate machine for those, you’ll slow down everything waiting for replies. Then figure out how you want to store, track, and access all the NPC outputs and world state.

It is a hobby project for me.

1 Like

It is a turn based, 5e SRD inspired, social only, theater of the mind RPG.
The baseline that I would like to reach is 5 turns per in game day, with a living world (125 NPCs), with a turn taking a reasonable (minutes) time to process.

Context engineering is presumed, I plan to use local RAG via Zep (knowledge graph), possible vectored memory on top.

I considered using meta agents (ie for factions), but the ideal here is to see if there is emergent behavior from the NPCs and their individual decision making and planning.

The current plan (not yet implemented) is to have, for each turn:

  1. generate action requests (agents state what they want to do based on their plans)
  2. aggregate those requests by location into a package (each NPC can be in a finite number of pre-set locations)
  3. process and resolve each package (do dice rolls etc), outputting dry outcomes to agents
  4. process resolved packages in locations with human player(s) to generate scenes
  5. use outcomes received by agents to patch their plans

However a prototype (fewer NPCs, longer terms) seems like a good instrumental objective.

2 Likes

So the cheapest, best value, GPUs you can get for inferencing are probably AMD Instinct MI50 with 32GB VRAM. Thos GPUs have 32GB each and cost about $200-$400 if you can get your hands on one or more of them. Be aware though that these cards are getting rare.

Some additional information on this. Back in the day you needed the manufacturers compute interface to do inferencing, this means you’d need CUDA support on Nvidia or ROCm support for AMD. These MI50 needed ROCm support and are EOL, so AMD does not officially support them anymore.

However there are two tricks to this. The first is that some distributions like Debian and Arch build ROCm themselves and enable support for older GPUs as well. According to the compatibility list the MI50 is still supported on Debian Trixie!
Secondly some smart people realized you can do inferencing with Vulkan as well, and if you build software like llama.cpp with the Vulkan backend the MI50 will do inferencing, completely without ROCm, just fine with the default amdgpu-driver and Vulkan.

The advantage of the MI50 is that you can get 96GB VRAM with acceptable performance for a homelab at just $600-$900. For comparison a RTX 3090 with 24GB VRAM, which is regarded as good value, still goes for around $700.

If you are serious and need top-notch performance or you have serious money backing the RTX Pro 6000 series cards are really the best of the best right now. They have the most efficient silicon as well as the fasted memory you can get as a normal person. Though they weight in heavily on the financial side with around $7500 for 96GB.

2 Likes

Too add to this my advice would be get a cheap Threadripper or Epyc system, possibly used. What you are looking for here is not the compute but mostly the possibility to attach multiple GPUs. I would not recommend you to run LLMs on CPU+RAM because I would definitely think that the speed on regular memory is too slow for what you have in mind. Though take my advice with a grain of salt because I do not know your usecase and have not actually tested it, it’s my experience from having played with this a bunch.

Get a Threadripper or Epyc system with enough PCIe lanes and attach a couple MI50 to that. That way you could get started and start experimenting for as low as $2000-$3000 and would have no need to run inferencing on CPUI+RAM. You could run multiple smaller LLMs locally distriuted over 3-5 MI50.

Then if you have a proof of concept and gathered some understanding what kind of performance you’d need for a commercial product you can extrapolate from the performance numbers to see what you’d need to buy for this to go into staging and production.

1 Like

Thanks for your input!

In my understanding 395+ uses onboard GPU cores to process the models, equivalent to a modest modern discrete GPU, but with a memory capacity to handle large LLMs (that won’t fit into 24-32gb class GPUs), it also appeals to me from the future scaling perspective - setting up many parallel asynch tasks, each on their own hardware, feeding completed tasks to a master node.

Downside is that I am not sure how it would work in terms of tokens per unit of cost, I suppose this would vary a lot depending on the nature of the tech stack (that I don’t have much understanding off beyond “lets go with the biggest Open AI public model”).

Do you have the data for tokens per second for those GPUs you mention, for various models?

You do have access to more memory, but even though the 395+ is four memory channels, it is slower then a GPU. So while you can run larger models you will have to run them a fair bit slower.