How can I be sure it’s using my GPU? Also, should I go down to a smaller model with this GPU; I realize it’s not a super high-end GPU. Honestly, I googled how to run DeepSeek R1 locally and chose one of the first links I came across. So is using Ollama the way to go, or might there be other potentially better ways to run LLMs?
EDIT: I definitely know that something’s not right. While I was trying to type this up, my desktop started thrashing due to Ollama using all of my memory, which is only 32GB.
You’ll be unable to fit an entire 70b model in a single 16GB GPU.
When you’re using ollama, you can use the ollama ps command to determine the status of it. Here’s an example of me using deepseek-r1:32b on my 6900xt, which also has 16GB of VRAM.
Because the 70b model is 42GB in size, you’ll likely need 48GB at minimum to run it entirely on GPUs.
Given that you only have 32GB of system memory, you’ll not be able to run 70b models with any amount of efficiency, because for each token generated, it will need to swap in and out the different sections of the model, which would result in 1 token every few seconds. You’d be better off measuring in tokens per minute, rather than tokens per second.
I highly recommend trying the 14b model, as that’s what I use when running deepseek on my local. It’s quick and only takes up ~11gb of vram.
Also keep in mind that the model needs to have GPU memory for context (previous user-input tokens and model-generated tokens it uses to remember the conversation history) and your OS will have some overhead as well. Considering all that, you’ll probably want to max out around 14ish billion parameters at ollama’s default quants.
don’t underestimate the usefulness of the lower parameter models. there are many strong 7b and 14b models out there that will fit on consumer gpus. And run faster not not having to offload to the cpu. I think it makes for a way better experience.
It really depends on your expected time to response. If you’re trying to have a “conversation” it makes sense to run a lower parameter model, but I’ve got 70b models that I’ll run on my CPU if I’m going to ask it something then step away for 20-30 minutes.
Another is that for every 1 Billion parameters ~ 1GB of Vram. Though the model size is a much better indicator of if the model will fit in your gpu.
There’s also a trade off between the parameters and quantization level. Generally for better model outputs, you want a higher parameter model at a the highest quantization level your system can handle.
Answer to the OP’s question: use the 14B r1 model for the 9070
Given the OP is running a rx 9070
the vram is ~12gb and the the model size is 9GB, the upper end of your models should be about 14B, as you should be able to run ~12B parameter models.
When your GPU doesn’t have enough vram to fit the model parameters for inference, ollama will use the system memory and cpu for inference. The tokens/second will be lower and slower compared to a gpu, but you can run bigger models at slower speeds
LMStudio is great, just keep in mind that the license is for personal use. Ollama has a MIT license, so if you are making money with your model endpoint, I’d definitely learn how to use it.
That said, I really like how they have hardware compatibility filters for models. It’s great if you don’t know what parameters and quantization. The devs also have someone who will frequently fine tune and release models when companies release them relatively quickly. I follow their discord to learn about new models XD
Other model suggestions: look into qwq for advanced reasoning models
If you are dead set on running a model similar to the r1 model, I would consider running qwq, as the r1 models on ollama are fine tuned qwen models
I’ve been trying out the unsloth models and I find that the outputs are much better than the base qwq models
The 70B model is a fine tuned llama model
Quantization vs Parameter count: Go for more parameters at a lower quantization when possible
Given the OP has a 12gb, I would start with the base 27B model as the size is only 16GB. If performance is good and fully utilizes the gpu enough stay there. Otherwise drop down to the 9B base model.
I would agree if you need a llm to follow your prompts, instruct is much better.
If you aren’t happy with the base models, try the instruct at higher quantization level and drop it down if ollama isn’t using your gpu. The next gemma version I’d try with the 9070, is the 27b-instruct-q3_K_S the model size is about 12 gb. If ollama is offloading it to the cpu, then I’d drop down to the 9b-instruct-q8_0 .
Repeat this process for any other popular models on ollama. The LocalLLaMA forum is a great place to see what models people use. The LM Studio and Ollama Discord servers are also great for learning when new models are available
Sorry for leaving this hanging. Doing a bunch of homelab stuff rn, and meanwhile work’s been busy while I have also been applying for jobs elsewhere. I’ll have to come back to this when I don’t have to deprioritize it.