Local AI very low GPU usage. Bottleneck or setup problem? (Solved)

ktarkmeel · August 12, 2025, 4:10pm

I’m repurposing my old Z440/E5-2673V4 20c/ 2x Quadro M4000 SLI workstation to host AI and let my main machine breathe. Everything seems to work fine, except the GPU usage spikes to around 90% for half a second after sending a prompt, then both GPU-s settle between 10-20% usage, when actively generating an answer. Both GPU-s are ancient Quadro M4000-s. SLI is enabled and active, because that seemed to help GPU utilization just a little. Could memory bandwidth cause such a dramatic effect? Currently only half of the channels are populated.

Edit: Version 0.3.23 of LM Studio fixed the issue

mathew2214 · August 12, 2025, 4:30pm

I suspect windows task manager to be innaccurate. Is there some better tool to measure gpu usage for nvidia on windows?

ktarkmeel · August 12, 2025, 4:35pm

The task manager percentages seem plausible, as the fans don`t spin up and the temps would be near 90 degrees, if the Quadros were actually doing any work

eousphoros · August 12, 2025, 4:37pm

Could be a software implementation detail, such as specific tensor types are forced to run on cpu instead of the gpu

ktarkmeel · August 12, 2025, 4:54pm

I

m thinking maybe those old Quadros might just be a bit too old :sweat_smile: The latest openai gpt 20b model just crashes immediately after prompting, but works fine on the same system with the Quadros replaced with an A4000. Gemma and Deepseek both seem to get along nicely with the SLI Quadro setup, if only it weren

t for the Quadros almost idling, while generating an answer

urw7rs · August 13, 2025, 12:33am

I think you need to select the cuda option on windows task manager to view GPU utilization for things like this. If you can use nvidia-smi or nvtop/nvitop.

igormp · August 13, 2025, 1:43am

Couldn’t it be that those GPUs are on the older side with slow memory bandwidth (less than 200GB/s each), and thus after prompt processing (which is the compute bound part), you just get heavily bottlenecked by the memory bandwidth during token generation, and thus the die usage is really low because the GPU is data starved?

ktarkmeel · August 13, 2025, 4:19am

Update:
Got the update to LM Studio version 0.3.23 today, which brought a huge performance uplift to the old SLI Quadros and openai gpt-oss-20b now works perfectly. GPU utilization is much better and compute very nicely switches between cards with SLI enabled.
I`m still trying to figure out the optimal models and settings for this system, but it seems the old Quadros still have some life and useful compute power inside them.