Help installing local LLMs

I would like to stretch this guy’s legs, my AMD Radeon RX 9070, with a local LLM. I followed this guide on installing DeepSeek R1 with Ollama and I chose the 70billion parameter model. I’m not sure if that’s too much, but it seemed rather slow to respond to my “Hello” prompt.

How can I be sure it’s using my GPU? Also, should I go down to a smaller model with this GPU; I realize it’s not a super high-end GPU. Honestly, I googled how to run DeepSeek R1 locally and chose one of the first links I came across. So is using Ollama the way to go, or might there be other potentially better ways to run LLMs?

EDIT: I definitely know that something’s not right. While I was trying to type this up, my desktop started thrashing due to Ollama using all of my memory, which is only 32GB.

Also try https://lmstudio.ai ?

1 Like

You’ll be unable to fit an entire 70b model in a single 16GB GPU.

When you’re using ollama, you can use the ollama ps command to determine the status of it. Here’s an example of me using deepseek-r1:32b on my 6900xt, which also has 16GB of VRAM.

Because the 70b model is 42GB in size, you’ll likely need 48GB at minimum to run it entirely on GPUs.

Given that you only have 32GB of system memory, you’ll not be able to run 70b models with any amount of efficiency, because for each token generated, it will need to swap in and out the different sections of the model, which would result in 1 token every few seconds. You’d be better off measuring in tokens per minute, rather than tokens per second.

I highly recommend trying the 14b model, as that’s what I use when running deepseek on my local. It’s quick and only takes up ~11gb of vram.

Also keep in mind that the model needs to have GPU memory for context (previous user-input tokens and model-generated tokens it uses to remember the conversation history) and your OS will have some overhead as well. Considering all that, you’ll probably want to max out around 14ish billion parameters at ollama’s default quants.

5 Likes

don’t underestimate the usefulness of the lower parameter models. there are many strong 7b and 14b models out there that will fit on consumer gpus. And run faster not not having to offload to the cpu. I think it makes for a way better experience.

2 Likes

It really depends on your expected time to response. If you’re trying to have a “conversation” it makes sense to run a lower parameter model, but I’ve got 70b models that I’ll run on my CPU if I’m going to ask it something then step away for 20-30 minutes.

1 Like

Heuristic for model size, parameter count, and GPU VRAM

Agreed. To add a heuristic, I usually look the file size on the downloads is approximately the amount of VRAM you need.


Another is that for every 1 Billion parameters ~ 1GB of Vram. Though the model size is a much better indicator of if the model will fit in your gpu.

There’s also a trade off between the parameters and quantization level. Generally for better model outputs, you want a higher parameter model at a the highest quantization level your system can handle.

Answer to the OP’s question: use the 14B r1 model for the 9070

Given the OP is running a rx 9070

the vram is ~12gb and the the model size is 9GB, the upper end of your models should be about 14B, as you should be able to run ~12B parameter models.

When your GPU doesn’t have enough vram to fit the model parameters for inference, ollama will use the system memory and cpu for inference. The tokens/second will be lower and slower compared to a gpu, but you can run bigger models at slower speeds

LMStudio is great, just keep in mind that the license is for personal use. Ollama has a MIT license, so if you are making money with your model endpoint, I’d definitely learn how to use it.
That said, I really like how they have hardware compatibility filters for models. It’s great if you don’t know what parameters and quantization. The devs also have someone who will frequently fine tune and release models when companies release them relatively quickly. I follow their discord to learn about new models XD

Other model suggestions: look into qwq for advanced reasoning models

If you are dead set on running a model similar to the r1 model, I would consider running qwq, as the r1 models on ollama are fine tuned qwen models

I’ve been trying out the unsloth models and I find that the outputs are much better than the base qwq models

The 70B model is a fine tuned llama model

Quantization vs Parameter count: Go for more parameters at a lower quantization when possible

If you want to learn more about model parameters vs quantization, this is a great post
https://old.reddit.com/r/LocalLLaMA/comments/1fjo7zx/which_is_better_large_model_with_higher_quant_vs/lnpiwxx/

In my experience, this still holds up.

There isn’t as many quantization levels for the deepseek-r1 model on ollama.

Gemma 2 and picking a model at varying parameter sizes and quantization levels

Gemma 2 is a pretty good example of having lots of quantization and parameter counts.

To illustrate this process, let’s assume that there isn’t a performance drop off at higher quantization levels.

Technically it drops off after q5

https://old.reddit.com/r/LocalLLaMA/comments/1etzews/
interesting_results_comparing_gemma2_9b_and_27b/

Given the OP has a 12gb, I would start with the base 27B model as the size is only 16GB. If performance is good and fully utilizes the gpu enough stay there. Otherwise drop down to the 9B base model.

You’ll also notice that there are other models such as ...b-text-q... or ...b-instruct-q..., these are the fine tuned versions. instruct refers to instruction fine tuned, which indicates the model is designed to follow instructions. text refers to text fine tuned, which indicates the model only takes and generates text.
https://old.reddit.com/r/LocalLLaMA/comments/16qvh2o/noob_question_whats_the_difference_between_chat/k1zboy0/

I would agree if you need a llm to follow your prompts, instruct is much better.

If you aren’t happy with the base models, try the instruct at higher quantization level and drop it down if ollama isn’t using your gpu. The next gemma version I’d try with the 9070, is the
27b-instruct-q3_K_S the model size is about 12 gb. If ollama is offloading it to the cpu, then I’d drop down to the 9b-instruct-q8_0 .

Repeat this process for any other popular models on ollama. The LocalLLaMA forum is a great place to see what models people use. The LM Studio and Ollama Discord servers are also great for learning when new models are available

2 Likes

My suspicions are confirmed. It’s not using my GPU. I must have something configured incorrectly.

~ » ollama ps                                                                                           linuxdragon@FirnenTechMachine
NAME               ID              SIZE     PROCESSOR    UNTIL              
deepseek-r1:14b    ea35dfe18182    11 GB    100% CPU     4 minutes from now    

Found more information:

Are you running this on bare metal or inside of a docker container? if it’s in a docker container you should be able to pass the gpu to the container

Bare metal. Though, had I known that I could run it in a docker container, I would have.

I can say I have an AMD gpu in my main pc and I use what Vishal_Rao stated about using LMStudio. Shockingly fast

What distro? You likely don’t have the ROCM drivers installed.

If they’re not installed, you’ll not be able to use the GPU no matter what software tool you use.

Fedora 41. I use Wayland too.

Check this page then:

https://fedoraproject.org/wiki/SIGs/HC#Installation

I’m not super familiar with fedora anymore, so I’m not super helpful on fedora-isms.

1 Like

Magic

~ » ollama ps                                                                                           linuxdragon@FirnenTechMachine
NAME               ID              SIZE     PROCESSOR    UNTIL              
deepseek-r1:14b    ea35dfe18182    11 GB    100% GPU     4 minutes from now    
2 Likes

But…

~ » ollama run deepseek-r1:14b                                                                                                                                                                                                                    linuxdragon@FirnenTechMachine
>>> Hello
Error: POST predict: Post "http://127.0.0.1:41819/completion": EOF

Need to see the ollama daemon logs.

Sorry for leaving this hanging. Doing a bunch of homelab stuff rn, and meanwhile work’s been busy while I have also been applying for jobs elsewhere. I’ll have to come back to this when I don’t have to deprioritize it.