I have found myself in odd spot when it comes to self-hosting an AI Chat with a WebGUI or at least it feels that way. Currently I am stuck and not sure how to proceed. I have a desktop I built last October that is not being used to its full potential at the moment, here are the specs:
AMD Ryzen 9 7950X 3D
MSI X670E MAG Tomahawk
96gb (2x48gb) DDR5-5600
4tb Samsung 990 Evo Plus
AMD Radeon 9070 XT 16gb
Previously I had a RTX 4070 Super in the system, but I kept having issues with Nvidia drivers (just unstable) and just got the 9070XT. I have not had an issue since on the system and its been stable. I am using Fedora 42 at the moment and I feel that it may be part of my issue. I have not been able to get llama.cpp or ollama to work. I was able to get ollama to work when I had the 4070 super, but I kept running into errors with the 9070 XT that are beyond my abilities to resolve. When I got the 9070 XT it was my first dedicated AMD GPU since they were ATI (Radeon 9200). I know llama.cpp provides a build for Ubuntu with Vulkan, so I was thinking of trying it and going from there . I am open to using another distro, just want some knowledge/advice before moving forward. If anyone has any tips, advice, links to other forum posts, etc I would appreciate it.
Solution: I installed rocm and llama-cpp packages from the package manager on Fedora 43. I also put a reply at the end of the post with a little more info. I can confirm the 9070 XT does work with ROCm in Fedora 43.
I did not document any of the errors. They were always vague and did not reveal a cause from what I could see. With Ollama the service would just simply not start and I did not have any errors during install/setup. It simply did not work. I never got anywhere with llama.cpp as I would have to build from source and since I had issues with Ollama I decided not to try, as it is a longer process and would likely not work from what I saw. If I setup Ollama to use the cpu only it would work, but kind of defies the point for what I want as I would be limited in performance. Running it on the GPU is the goal. Using Ubuntu is the only direction I know of at the moment and things are documented there. I have not seen it for anything else.
Were you using Docker or running it directly on the machine? I have Ollama + Open WebUI running on a POP OS machine with an NVIDIA GPU (it’s a Quadro-type card) via Docker Compose, without an issues. IIRC, for AMD GPU acceleration, you need either ROCm or Vulkan, and the Ollama docs show compatible cards for ROCm support (although I’d assume the 9070 XT might not have been added yet). That doc also has info on using Vulkan for Ollama, and you might want to check out this issue.
The 4070 does not have more RAM then the 9070XT so it should not matter much.
You should be able to use any Linux distribution you’d like. Then you compile llama.cpp with the Vulkan backend yourself and you’re ready to go. As a front end I would start an OpenWebUI container locally that uses the llama.cpp backend.
You are a tad limited in which models you can run. @ubergarm here has models on Huggingface that are suited to run on both GPU + CPU but usually they are for GPUs with 24GB VRAM, so you need to see which ones fit on 16GB. You could reference the other thread.
I was having issues too. Followed a whole bunch of network chuck videos and failed…
Then I switched boxes, instead of a dedicated Linux box, I added the GPU to main server with bare metal win server 22. WSL2 + Docker for desktop.
Ollama for windows runs native, and then its a single command in command prompt do download the reconfigured WEBUI container and its done and works perfect. One command, zero configuration necessary, and I saved needing a dedicated server for it.
With 96GB RAM and 16GB VRAM you have some flexibility. Given your new GPU is AMD you unfortunately won’t be able to run as many of my ik_llama.cpp quants as they generally use quant types without kernels on the Vulkan or ROCM backends.
However, you can use ik_llama.cpp with mainline quants and still get benefit speed ups on your CPU/RAM offloaded portion. and ik works fine with Vulkan / ROCm backends for those mainline quant types (stuff by bartowski/unsloth/mradermacher etc on huggingface).
You can build ik_llama.cpp pretty easily on Linux if you have some experience installing os level dependencies e.g.
git clone https://github.com/ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=OFF -DGGML_VULKAN=ON
cmake --build build --config Release -j $(nproc)
# start off with a small model like Qwen3-14B Q4_0 (which is legacy quant type with more compatibility across vulkan/rocm etc)
# https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF?show_file_info=Qwen_Qwen3-14B-Q4_0.gguf
wget https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF/resolve/main/Qwen_Qwen3-14B-Q4_0.gguf
./build/bin/llama-server \
--model Qwen_Qwen3-14B-Q4_0.gguf \
--alias bartowski/Qwen3-14B-Q4_0 \
-ctk q8_0 -ctv q8_0 \
-c 8192 \
-ngl 99 \
--threads 1 \
--host 127.0.0.1 \
--port 8080 \
--parallel 1 \
--jinja
Then open your browser to http://127.0.0.1:8080 to get a free built in webui experience.
Once you have this up and running then you can experiment more adding on custom clients (e.g. roo/cline/mistral-vibe/zed/openwebui etc…) or trying different models and quants. With that much RAM you could get a larger MoE model and put all the routed experts on cpu/ram and only the attn/shexp/first N dense layers on GPU and have a decent quality local model. 16GB vram will cramp you on how much context you can fit.
Fedora 42 does not support rocm, Fedora 43 might so I am upgrading to it and I am going to try again. I have been using it on my Framework 13 for about a week and have only noticed improvements. If that does not work I am going to try Ubuntu. I have a spare SSD I can swap in and I have been needing to remove 2 HDD’s that I no longer need in the system. I am just debating between 24.04 or 25.10, a return of the LTS vs rolling release debate… Ubuntu 26.04 is going to have native support for rocm, but that is still ~5 months away.
I am running it directly on the machine. I am not a big user of docker and prefer to setup up things manually. rocm does support the 9070 XT from what I have seen and read. I am upgrading to Fedora 43 at the moment as Fedora 42 does not support rocm, 43 might.
The 4070 Super that I have is 12gb. My 9070 XT has 16gb. It is not much of a difference, but it is 4gb more. When I used the 4070 super that 4gb would have made a big difference. I would still likely be using it if it had 16gb as support for nvidia, cuda, and all things AI is better supported/documented. I am just trying different as being nvidia prices for a better gpu was a deal breaker. I got my 9070 XT for just above msrp when I got it. It was the $50 more that I paid for the 4070 Super. I also do not expect to be able to run the latest and greatest models.
I upgraded to Fedora 43 and I found that ROCm and llama.cpp were in the package manager. I do not recall having looked on Fedora 42. I decided to try the ROCm package that was there and it is working! I then did a build of llama.cpp. I had some dependency issues and I was able to get that resolved. I also was able to build it without any errors. When I went to test it did not run, I had more errors. I then decided to try the one in the package manager and it works, verified it is using the GPU… I just used an example that was on the github to test with and no issues.
I also tested with Ollama again and it is working without any issues as well, verified it is using the GPU as well. It is also working with Alpaca.
I am sure that the two packages in the package manager are not the latest, but it works. It is enough for me for the time being. I still have more to learn as I am very new to llama.cpp. I have used hugging face before but I feel overwhelmed by the number of models, so if anyone has any recommendations or tips it would be appreciated.