Today, I discovered...llama.cpp router mode

digitalscream · January 7, 2026, 9:31pm

I’ve been grumbling for ages about being stuck with Ollama, because the wife prefers gpt-oss and I need qwen3-coder, and I only have one machine without enough VRAM to run both simultaneously - meaning that we need something that can switch models on the fly and manage the GPU accordingly.

Well, today I discovered that llama.cpp has a router mode as of a few weeks ago - basically, you just fire up the server but don’t specify a model, only a model directory (--model-dir). Then, just point Open-WebUI at it (for the multi-user capability, which llama.cpp’s server lacks), disable Ollama, and have a much happier life with all the stuff Ollama wouldn’t let you tweak.

The only catch appears to be that you need to restart the server to pick up new models (if somebody knows a better way to do it, I’m all ears). I’ve got it running from docker compose, so it’s pretty trivial.

Apologies if I’m late to the party on this one, but…yeesh, I’d have thought they’d be shouting this one from the rooftops; Ollama is basically obsolete now.

ubergarm · January 7, 2026, 9:51pm

Nice, and agreed ik_llama.cpp/llama.cpp are the way to go for best performance and latest features! Though various precompiled downstream projects are convenient to get started for new folks.

Before the new “router mode” feature I believe people were using a proxy like:

in front of the llama-server endpoint to handle the model switching. Not sure if it needs a restart to pickup new models or not.

Cool to hear it’s working out for you as yes ain’t no one gotta enough VRAM to keep all the unused models loaded haha…

Cheers!

thro · January 8, 2026, 9:24am

NICE, I’ll try this out this evening!

I’ve been using ollama a lot because of the multi model support but llama.cpp actually seems easier to get GPU support working with on linux.

digitalscream · January 8, 2026, 11:24am

For anyone wanting to try this out, I knocked together a quick set of instructions for somebody else earlier…

gist.github.com

https://gist.github.com/digitalscream/6ec1db8d5db96e5d59dd1cb83a18d267

llama.cpp router mode with open-webui

1 - Install Docker, with `docker-compose-plugin`.
2 - Create a directory `llama`, and in there create another directory `models`. This would be a great time to put a .gguf file in
    the `models` directory.
3 - In the `llama` directory, create `docker-compose.yml` with the following content:

~~~~~~~~~~~~~~~~~~
name: llama
services:
    llama.cpp:
        deploy:

This file has been truncated. show original

If anyone spots an issue with it, let me know and I’ll update accordingly.