I use Open WebUI with Ollama 0.9.0
Wow, thanks for this! Very informative.
I’ll test my Qwen3 14B Q8 and see if it has any difference in performance vs what you have posted here and also the 32B when exceeding 24 & 32GB.
I have done plenty of work with both Qwen3 14B Q8 and BF16 and your response token/s numbers look about the same as I get with the W7900, but I still want to test with the same prompt. However it looks like you are only getting the performance of 1x 7900XTX despite having 2 of them.
If the model fits entirely on a single GPU, Ollama will load the model on that GPU.
Or Ollama loads as much as possible onto one GPU and whatever is left over goes to the second GPU, and the compute load is distributed across the GPUs accordingly.
In case of llama3.3:70b the first GPU shows 90% load and the second GPU about 50%.
The thing is, I don’t have a clue yet what I can expect from this setup.
I will try another inference framework next week, simply to see how the cards behave in different environments.
Maybe something for you
Well I am just saying that the performance numbers you got on the Qwen3 14B BF16 are almost identical to the performance numbers that I see with my W7900. Same with the 14B-Q8. My performance numbers are actually slightly better.
Since I get 20.33 tokens/second with your prompt on the 14B-BF16 and 30.15 on the Q8 version, although I had other stuff running at that time… so it would have been done a bit faster.