Running DeeoSeek-R1 (full) locally - what are my options with WRX80 workstation?

So I hate cloud. And I hate (not mine) datacenters. Up untill now I wasn’t even really interested in AI. However when I heard that DeepSeek is challenging GPT AND can be run locally it kind of captured my attention as I’m typical maniac running homelab datacenter.

When I’ve seen full requirements for running R1 in 32bit mode i thought it’s lost cause but later on I started digging, testing those small 70B models on my PC, learning about quantization and came to conclusion that… Apparently I may not be too far from barely running full R1.

My home PC happens to be based on WRX80 platform (TR PRO 5965wx) because I’m into VFIO gaming and crazy storage solutions and I YOLO bought 256gb ram for it and 2x 3090 + Quadro RTX 4000 because I decided it will be funy to play Skyrim on that. So it does in fact already have some capacity and effortlessly runs 70B, even inside VM. But 70B is not real deal. It’s not BAD but it doesn’t really have that “wow” factor I hoped for. So I want more.

In order to make it viable in any form I bought 512gb of DDR4 RAM. 8x64gb 2Rx4 3200, specifically for this project (hopefully i will be able to sell my 256gb set, otherwise I’m gonna feel very silly). My GPUs also happen to have 56 GB of VRAM combined which totals at around… 568gb of RAM+VRAM. Then worst case scenario I have 4x NVME gen 4 RAID10 array capable of 28 GB/s read.

It’s very YOLO project that I didn’t really think through all that much but I think I’m quite close to somewhat being able to make it happen.

So my question is - did anyone here run R1 locally on home hardware? My main question is whether I can count RAM+VRAM like this (aka do they really add up, or will I still be limited to 512gb and just 56gb will be duplicated into vram) and can I offload some of it to stream from NVME to make a bit of space for longer context?..

I’m fairly new to this and fairly clueless as to what I’m doing so sorry if my questions are really dumb . _ .

I can’t say I’ve done it, but this topic has been surprisingly popular; here’s a guide of someone who is running DeepSeek on a local setup that sounds very close to yours: How To Run Deepseek R1 671b Fully Locally On a $2000 EPYC Server – Digital Spaceport

2 Likes

I’ve seen this one but dude says it only runs with 2048 context which according to my testing is fairly useless. But guy didn’t have any GPUs in that rig and didn’t have particularly fast storage so I think i might be able to squeeze at least 4096 context maybe?.. As long as it adds up thevwaybi think it does. I would call that a day but if not then I guess I will need to look into ssd streaming options which… I don’t know much about.

I think the other key is mentioned in that guide; they mention the use_mlock option of Ollama, which is the same as the --mlock option if you’re running directly with llama.cpp. Using that will attempt to not swap or compress the parts of the model that are loaded in memory, and presumably will also then read bits off the disk again as needed.

I’ve been testing the 671b model locally on my 32-core EPYC with 512GB of DDR4. I can get just under 2t/s running all on the CPU. However, it’s limited to a maximum of 5K context with that much memory, and I’ve found that it mostly just talks itself in circles while <think>ing and then runs out of memory and crashes. I think for any useful testing of the full-fat model you might need 1TB of RAM, or at least 768GB for those lucky folks who have 12 channels.

docker run -d -p 3000:8080 --gpus=all -v ollama:/root/.ollama -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:ollama

Then open localhost:3000 in a browser. Make an account, open the admin settings, go to connections, press Manage Ollama, type “secfa/DeepSeek-R1-UD-IQ1_S” into the import field, and wait for the download.

I was running 32B as well as 70B in various quantizations but it’s just Llama/Qwent models re-trained. It’s not a real deal and 32B is fairly dumb actually. Especially if you ask him about some niche stuff. 70B Q5_K is s bit better but still afaik not even close to full blown DS.

lapsio is correct, the deepseek R1 models hosted on Ollama are Llama distilled. Meaning they used Llama as a base and trained Llama off of Deepseek to create a better llama model.

Are you quite sure? If I ask it what it is, it says its name is Deepseek, and if I try censored questions like “What’s Tiananmen Square?” it won’t answer unless I edit its reply to “Sure, let me explain that for you.” and then press Continue Answer.

Yes. On hugging face in deepseek R1 repository (official), original images have full names like “deepseek-r1-distill-llama-70B” so you can see more clearly that it is in fact llama. 70B is llama, 32B and below are Qwent.

Also FYI there is no such thing as Uncensored full R1 available afaik. At least not publicly. It was fairly sad news to me. I consulted team responsible for pre-training Uncensored R1 variants and apparently there are bugs in AMD and NVidia drivers that make retraining in Q8 and below impossible and they weren’t able to obtain hardware capable of retraining at FP16 because it would require like 2.5TB vram. So until it gets fixed ig we need to wait. There are known procedures how to un-censor any R1 variant but nobody in community has hardware and will to do that ig.

Thanks for the tip about editing its answer, i’ll try that with uncensored model and see if it’s able to provide the same answers as uncensored builds.

Thanks for letting me know, I feel so deceived. Still pretty impressive for just being Qwent.
I did find out that someone else had merged together the unsloth quant of real Deepseek R1 and put it up on ollama, and that seems to work fine. I’m downloading the IQ2_XXS to merge and add to ollama manually right now to see if maybe that will work better. I’ll just copy secfa’s modelfile to get the proper chatbot initiator, my own attempt to merge and install IQ1_S ended miserably and wouldn’t even do any thinking, probably because I didn’t have a proper modelfile for it.

“How many 'R’s are there in the word Strawberry?”
“In this case, there are two. So, the total is 8 (from Pineapple) + 2 (from Strawberry) = 10.”

You can run full DeepSeek at a decent quant with ktransformers.

1 Like

Yeah, we have a whole thread on running DeepSeek-R1 671B unsloth quants (no the little distill models) over here:

tl;dr;

Your best bet today is check out this guide on ktransformers like @piper-nigrum mentions. It has precompiled binary python .whl files you can simply uv pip install and should get you going.

Details

  1. You want the fastest ram bandwidth you can get to maximize performance. Guessing you can hit 10 tok/sec with the smaller UD-Q2_K_XL quant and maybe 6 tok/sec with the Q4_K_M quant. Exact quants and sizes listed in the guide. Probably configure your BIOS to NPS1 for a single NUMA node to keep it simple as the software is not easy to optimize.

  2. You’re only gonna use a single GPU probably, whatever you have that is fastest. There isn’t advantage to using multiple GPUs right now really in terms of speed. If you offload more MoE layers it hurts cuda graph performance so is slower actually. You might get a little more context by using them together but you’ll have to manually edit the .yaml file to offload them (with ktransformers anyway).

  3. No need for RAID arrays as a single Gen5 NVMe maxes out system kswapd0 buffered i/o disk access anyway for situations where the model is too big to fit into RAM. But 512 GB is perfect for running stuff smaller than Q4 which is still great quality.

Smaller Options

If you want speeeed I’d suggest going with either vllm or sglang and using the new Qwen/QwQ-32B which is about the best model going right now that has similar <think></think> reasoning ability. Its better than the other earlier distill models.

You could also use llama.cpp but these engines do a pretty good job with tensor parallel on multiple GPUs depending on the quant (not all quants support -tp 2)

Keep in mind this assumes you are not using NVLINK and are not using the hacked nvidia driver to re-enable P2P on supported GPUs.

vllm

# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh

# https://docs.vllm.ai/en/stable/getting_started/quickstart.html
mkdir vllm
cd vllm/
uv venv ./venv --python=3.12 --python-preference=only-managed
source  venv/bin/activate
uv pip install vllm
vllm --version
# INFO 03-07 11:13:15 __init__.py:207] Automatically detected platform cuda.
# 0.7.3

CUDA_VISIBLE_DEVICES="0,1" \
NCCL_P2P_DISABLE=1 \
vllm serve \
    OPEA/QwQ-32B-int4-AutoRound-gptq-sym \
    --download-dir /mnt/raid/models/ \
    --dtype half \
    --served-model-name "QwQ-32B-GPTQ-INT4" \
    --tensor-parallel-size 2 \
    --enforce-eager \
    --disable-custom-all-reduce \
    --host 127.0.0.1 \
    --port 8080

INFO 03-12 11:29:19 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.3%, CPU KV cache usage: 0.0%.

sglang

git clone https://github.com/sgl-project/sglang.git
cd sglang
uv venv ./venv --python 3.12 --python-preference=only-managed
source  venv/bin/activate
uv pip install wheel setuptools packaging
uv pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer-python

CUDA_VISIBLE_DEVICES="0,1" \
NCCL_P2P_DISABLE=1 \
python -m sglang.launch_server \
    --model-path OPEA/QwQ-32B-int4-AutoRound-gptq-sym \
    --download-dir /mnt/raid/models/ \
    --served-model-name "QwQ-32B-GPTQ-INT4" \
    --dtype half \
    --tensor-parallel-size 2 \
    --enable-p2p-check \
    --host 127.0.0.1 \
    --port 8080

[2025-03-12 11:26:18 TP0] Decode batch. #running-req: 1, #token: 657, token usage: 0.00, gen throughput (token/s): 56.91, largest-len: 0, #queue-req: 0,