We’re looking to build a local compute cluster to run DeepSeek-V3 670B (or similar top-tier open-weight LLMs) for inference only, supporting ~100 simultaneous chatbot users with large context windows (ideally up to 128K tokens).
Our preferred direction is an Apple Silicon cluster — likely Mac minis or studios with M-series chips — but we’re open to alternative architectures (e.g. GPU servers) if they offer significantly better performance or scalability.
Looking for advice on:
Is it feasible to run 670B locally in that budget?
What’s the largest model realistically deployable with decent latency at 100-user scale?
Can Apple Silicon handle this effectively — and if so, which exact machines should we buy within $40K–$80K?
How would a setup like this handle long-context windows (e.g. 128K) in practice?
Are there alternative model/infra combos we should be considering?
Would love to hear from anyone who’s attempted something like this or has strong opinions on maximizing local LLM performance per dollar. Specifics about things to investigate, recommendations on what to run it on, or where to look for a quote are greatly appreciated!
I’ve reached the conclusion from my own research that full context window with the user county I specified isn’t feasible. Thoughts on how to appropriately adjust context window/quantization without major loss to bring things in line with budget are welcome.
I think the kicker is going to be your 100 user count.
Are you talking 100 users occasionally accessing it (e.g., maybe 5-10 or so with active queries at any given moment), or 100 concurrent users hammering on it? There’s a significant difference.
Either way I think doing it locally in budget with low latency is going to be extremely difficult.
I’m no AI expert but I’d consder some of the smaller models to handle the bulk of the queries and give the end user the choice or some sort of queue to access the large model(s).
The specification given to me was, more or less, “40-80k to spend, largest model we can run with peak 100 concurrent users.” I have found while researching this myself at the same time as posting around that that number of CCU increases the spec requirements hugely.
I’m not sure how to best handle that— quantization of the same model, shrunken context windows, or what.
Guessing the directive came from management who have limited clue.
At that price I’d guess a bunch of Mac studios would be your best bet, however it’s going to be pretty janky like a bunch of Mac studios stuck together with thunderbolt is always going to be.
Key question: Do they know what they’re trying to do?
80k buys a fair amount of licenses for Claude, ChatGPT plus, Copilot, etc. Or time put into building a Teams agent, if they’re an MS shop (for things like knowledge base queries, etc.).
It also goes some way towards refreshing end user machines with some capability of running local models via lmstudio/llama/etc. and then handing off things that are too big to the cloud.
If they have some fantasy about getting 100 users worth of responsive access out of consumer hardware, they’re in for disappointment is my guess.
My M4 Max can run local models that fit into its 64 GB Ram, but the speed is “ok I guess” for a single user. Smaller models are much faster - if you have an actual workload they want then I’d be testing it with Gemma3 27b and similar which might give you some more headroom for more users. The improvement in small models lately has been… huge.
Also 1 model is not a “fits all” solution. reasoning models are good for some things, but more prone to hallucination. e.g., if they want to use it for information retrieval, reasoning is bad - you’ll end up with information pulled from multiple potentially unrelated documents and also made up via hallucination in your response.
Disclaimer: I’m no AI expert, but I’ve been dabbling with local models (via lmstudio/llama and gpt4all) and chatGPT, etc. for about 6 months now.
We’ve also been considering this locally where I work, but I think right now we’re on the cusp of AI browsers with inbuilt agents being the next big thing and we may be best off investing in hardware with local NPU for small stuff that sends larger stuff off to a cloud model.
I’m an undergraduate researcher working with limited context from the professor. From discussion on other forums the best bet I’m seeing would be something using a few RTX 6000 Pro Blackwells to run a 4-bit quantized deepseek with a decent context window for a good few users. I’ve been heavily discouraged from going Apple at this price point over there, not sure if that holds true here.
Nothing I said in my post is particularly firm apart from concurrent user counts. That’s my main target.
It may be investigating multiple high end Xeon machines with a lot of local memory. The large 670b models need a heap of memory and getting GPUs with enough could prove expensive? Haven’t found benchmarks but guessing there’s a lot of 3 year old server hardware out there with 1/2TB of RAM or more that could run this.
Not sure how the benchmarks stack up though. But what you lose in raw performance could (maybe?) be made up for with multiple nodes just brute forcing it with CPU?
This should be doable esp if you are not counting software developer time as part of that 40-80k.
Personally I would avoid the mac route, esp for anything nearing production demands. 40k * 2 would get you two decent machines with a moderate cpu and ecc ram and a bit of storage. With a bit of context engineering you can run multiple users off the same model with latency being the main thing that increases as you scale. Only reason for two machines would be to keep things up when you need to take one down with the side effect of increased latency while you continue to service requests. Anyways this could end up being a whole design doc but…
mac’s are great for running large models, but not for running them fast. If you have 100 users then the token performance is more important. as they are waiting for quite a bit after each request!
running 670B is going to take a lot of VRAM. the best bet would be to use a supermicro with 10 RTX 6000 or something like that. and run 670B quantisized to Q4. (even that can’t run the full model). That would cost about 80k.
Since the requirement is “40-80k to spend, largest model we can run with peak 100 concurrent users,” it might be useful to consider the absolute maximum amount of VRAM that much money could buy. No amount of system RAM bandwidth is remotely going to be enough to serve 100 concurrent users at any affordable scale.
Used RTX 3090’s are maybe $800 a pop? That might be one of the only ways to get enough pure VRAM capacity to run one of the q4 quants with that kind of funding, but I think it will be an expensive maintenance nightmare, given the age and usage records of those cards from peak-mining era and still on the market, if it would not be too hacky and yes, unreliable, for institutional usage.
It’s certainly possible, though; Someone did a 16x3090 build a while ago, though that’s with a q3 quant, and they did not benchmark batch performance in that discussion.
Lower quants would improve capacity and speed, but quality will suffer. Q2_K quants are still surprisingly intact, but may need testing to confirm whether that would still be suitable for what purposes you had in mind.
The only setup I can imagine coming close to these requirements would be a used DGX A100 if you can find one for $75-80k. It would get you 1TB of RAM and the NVSwitch inside allows sharing all the GPUs to look like a single massive GPU with potentially up to 640GB of VRAM depending on if your used system is fully built ou8t or not. That might get you close to your desired system use.
Just be aware of your electricity prices as well. Depending on where you are located the energy cost to run that system each year could be anywhere from as low as $5000 to as high as $19000.
You need to find an expert who has successfully implemented such installations.
The consultation costs money, but it’s necessary.
As soon as you need more than one node, things get complicated.
What is your time horizon for the project?
The next server generation will triple memory bandwidth, so you’ll have 3.2 TB/s combined bandwidth in a two-socket system.
Might be able to get the consultation (at least for hardware configs) for free depending on how the busy the prebuilt vendors are at the moment. They could probably sort them out with a 30 min phone call.
I don’t think they are going to hit their 100 concurrent users tho without some software development time no matter what way they cut it at that budget.
Without really structuring these thoughts too much maybe it would be enough for someone to crawl github and find a gem.
Scaling context windows. No reason to fully allocate a 160k context when most of your users will probably burn through a thousand or two per session.
Queuing system for requests that handles migration of contexts based on session. Eg two workers servicing the queue, unload previous context blob, store it somewhere, load previous or allocate new blob.
Load balancing for the models, probably want some sort of sticky connection by way of cookie or header. Doing this stateless would balloon the cost.
And as a bonus
Evaluation of the requirements for performance of the model. Exploration of various quantization or casts of the model. This reduces resources but might not be able to do what is needed. Eg make believe 1QS quant can reliably recall 80 digits of PI before it starts hallucinating, the 4QKM can do 10,000, but we only need 50. Can we get away with a q8_0 ctk/v or do we need fp32?
I don’t really know much about the subject.
I’ve only been experimenting with LLM for 2-3 months.
How much quality is lost through quantization seems to be an open debate.
Since reading this, I’ve been leaning toward full-precision.
The question is what else the model might have lost, if it used to be able to count Pi to 10,000 digits to full accuracy but now could only count up to 80. They may well start making mistakes that make them useless in an academic setting, if they were useful in any manner to begin with.
Subjectively, q2 quants of 2-3 bits/param already show a perceptible reduction in subtlety and output quality, when compared to q4 quants of the same model, even for some of the largest open-weight models currently available.
Compounded by the fact that few bother to benchmark quantized versions beyond a bare minimum e.g. perplexity/MMLU, to check if they had accidentally scrambled it into gibberish oblivion.
I see two factors working against this kind of thing in any current open-weight models, though:
First, the most popular, most useful, and most powerful of them, DeepSeek V3/R1 & Llama 3/4 & Mistral Small & Qwen & Kimi-K2 etc. are no doubt under close scrutiny, that any unexpected behaviour enabled by quantization or any other seemingly innocuous changes would be quite disastrous for their creators. Imagine the headlines if any one of them is discovered to have had hidden finetunings to such effect as outlined in that paper.
Second, the paper is from 2023, when the state of the art of the open-weight landscape indeed counted among its members the models the authors used in their paper. Take a barely-competent model overfitted to the problem types evaluated, and (to my very limited understanding) quantization would certainly damage their performance on such problems. Current models tended to be more competent in a more useful manner to begin with, possibly reducing the qualitative shift observed, but I would be curious of any result of the same test redone on some current open-weight coding models.
Either way, running unquantized, in the case of DeepSeek-V3-0324 671 billion parameters, mostly in fp8, would be entirely out of question for the OP and their performance specifications, while quantization may at least bring it into the realm of possibility.
It’s possible I missed it but I didn’t see any latency or token rate requirements. Which means you can scale it across the time axis. Each user might only get 0.05 token/s but it would work.
But yes, @ethann OP might want to specify that as well. Requirements for chatbot-like interactive usage, RAG-backed content generation, and code completion are vastly different, mostly due to differences in prompt size and latency sensitivity.
Realistically speaking for new hardware (presumably with academic discounts) to get the most RTX PRO 6000s you can get. Personally, I would put them in a 2 X EPYC 9004/9005 system w/ 24 channels of DDR5 (1TB+ ideally, but whatever you can fit in after GPU cost). If you can fit 8X RTX Pro 6000s then you can run decent quants w/ SGLang or vLLM, which would be your top option for concurrency. (This would fit in your budget even without discounts.)
Your other options are potentially a DGX Station - 288GB of fast HBM, another 496 GB of “decent” speed DDR5X (396GB/s) and should be in your budget but these don’t come out for another few months. And some vendors like Exxactcorp are still selling old MI250’s. I normally wouldn’t recommend it, but with an academic discount and 128GB HBM 2e/module, it might be worth chasing down a quote just to compare.