I’ve been grumbling for ages about being stuck with Ollama, because the wife prefers gpt-oss and I need qwen3-coder, and I only have one machine without enough VRAM to run both simultaneously - meaning that we need something that can switch models on the fly and manage the GPU accordingly.
Well, today I discovered that llama.cpp has a router mode as of a few weeks ago - basically, you just fire up the server but don’t specify a model, only a model directory (--model-dir). Then, just point Open-WebUI at it (for the multi-user capability, which llama.cpp’s server lacks), disable Ollama, and have a much happier life with all the stuff Ollama wouldn’t let you tweak.
The only catch appears to be that you need to restart the server to pick up new models (if somebody knows a better way to do it, I’m all ears). I’ve got it running from docker compose, so it’s pretty trivial.
Apologies if I’m late to the party on this one, but…yeesh, I’d have thought they’d be shouting this one from the rooftops; Ollama is basically obsolete now.
Nice, and agreed ik_llama.cpp/llama.cpp are the way to go for best performance and latest features! Though various precompiled downstream projects are convenient to get started for new folks.
Before the new “router mode” feature I believe people were using a proxy like:
in front of the llama-server endpoint to handle the model switching. Not sure if it needs a restart to pickup new models or not.
Cool to hear it’s working out for you as yes ain’t no one gotta enough VRAM to keep all the unused models loaded haha…