My current local AI system is an older platform (Intel Core i7 6700K, Asus Z170-E motherboard, 64 GB DDR4-2400 RAM, 2x Gigabyte 3090s).
At idle, it is sucking back around 70 W.
I was wondering if there might be a more performant CPU/platform out there where when it is waiting for AI requests, that the idle power consumption will be < 70 W, but would still give decent performance when it is answering said AI requests?
i.e. I know that slower processors can be lower power, but that comes at the sacrifice of performance.
I am trying to optimise between performance and power such that it would be the lowest idle power consumption with the minimum loss of performance.
I thought about using mini PCs, but the problem is that a lot of them donât have multiple PCIe 3.0 x16 slots for my nearly triple-width GPUs, which limits motherboard options.
Any recommendations will be greatly appreciated.
Thank you.
(For reference, I am using the open-webui and running ollama running the Codestral:22b model, so itâs not super intensive AI workload in terms of responding to AI queries.)
You donât have a lot of room for manoeuvre here - those 3090s will be idling in the 21-25W range each. That leaves you with a budget of 20-30W for the rest of the system.
I was about to suggest the rtx 7000, but then I read the 3090s. Wendell ran a ryzen 7900 on his ai inference rig
But if you are doing most of your inferencing on the GPUs, you can maybe look into Wolfgangâs channel
He did a low power nas build
Unless you were thinking of using the CPU as another potential ai inference hardware? If so, you might want to wait for the desktop equivalent of lunar lake with ai cores
Or for a chinese manufacturer to make a lunar lake desktop board like this
I suspect even low end modern CPUs will at least get you something closer to the 6700K. And since you are inferencing with GPUs anyway the CPU doesnât really matter as long as it has the required PCI-E lanes for the 2 GPUs.
Overall Rank:
495th fastest vs 840th fastest in multithreading out of 4754 CPUs
410th fastest vs 816th fastest in single threading out of 4754 CPUs
165th fastest vs 270th fastest in out of 1368 Desktop CPUs
Other names:Intel(R) Core⢠i7-6700K CPU @ 4.00GHz, Intel Core i7-6700K CPU @ 4.00GHz
CPU First Seen on Charts: Q3 2015
Overall Rank:
1211th fastest in multithreading out of 4754 CPUs
788th fastest in single threading out of 4754 CPUs
380th fastest in out of 1368 Desktop CPUs
Iâm using 3090s because itâs existing hardware that I already have (trying to spend as little money as possible if I were to buy new hardware to change the system/platform out).
So, according to HWInfo, the 3090s are actually idling at somewhere between 6-8 W. But thatâs based on the software reporting.
I donât have the tools necessary to be able to measure what the idle power draw of the individual 3090s are, so I canât really tell how much of that 70 W idle power draw is because of the 3090s vs. because of the rest of the platform.
So in terms of the impact of CPU performance on the inferencing speed â it does make a difference because âtyping outâ the response(s) uses the CPU, but getting the answer comes from the GPU.
So, the 3090 can be really fast at the inference, but when it comes to open-webui âtyping outâ the response, thatâs entirely a CPU-bound task.
Thus, really fast GPUs paired with a really slow CPU still results in a modest/relatively low tokens/second being processed with the CPU being the bottleneck.
Would an AMD CPU be better for this task/use case or would Intel still fair better here?
(I havenât done much testing with running the AI workload with an AMD based system.)
(I forget if I mentioned this above, but the reason why I am running two 3090s is because I need the combined total of 48 GB of VRAM to be able to fit the TheBloke/Mixtral:34b model (IIRC) as it takes something like 34-38 GB of VRAM, so a single 3090 with only 24 GB of VRAM isnât enough (was getting out of memory errors), and again â rather than buying a different GPU with more VRAM in it, I was just re-using what I already had and I was able to split the model up between the two 3090s to leverage the distributed VRAM. The cards drops down to PCIe 3.0 x8 rather than being able to run at PCIe 4.0 x16 each, but I donât have a Threadripper system which would give the cards the PCIe lanes that theyâre due.)
I might be wrong. I donât think you can even find a modern low cost general purpose CPU that can wire 2 3090s where running the web frontend would be the bottleneck. I wouldnât overthink on CPU other than required PCI-e lanes.
(I mean, it couldâve been possible that HWInfo is not reading the power consumption correctly.)
SoâŚmy testing on this is a little bit limited because I donât really have very many systems that can take multiple PCIe x16 devices and still have enough power for everything else.
(i.e. my HP Z420 workstations, I forget if it can even physically fit two 3090s in there, and even if it could, there arenât enough power PCIe 8-pin power connectors to be able to power said two 3090s. My only other system that can do that has a 6500T in it, and it also doesnât have enough RAM to be able to run some of the larger AI model (I think that it tops out at 16 GB).)
But in testing some of the smaller models, I noticed that the GPU barely sweat, but the CPU is frantically trying to âtype outâ the responses, which results in a lower token rate, even though the GPUs barely even budge.
This is why I think that my CPU might be the bottleneck as a LOT of other people (tech YouTubers, Wendell, etc.) are using VASTLY more powerful CPUs than I am; hence the question.
Like it would be one thing if it was fielding questions from multiple users all day, every day, then sure; idle power consumption really isnât a much of an issue, but if it is just me thatâs really using it, and Iâm at work for 10 hours a day, and then on parenting duty after that; it means that most of the time, the system would be spent idling.
So, HWinfo reports that it idles at somewhere between 6-8 W each, which if it is anywhere remotely close to it being accurate, would peg the GPUs at 12-16 W total out of the 70 W that it is consuming, at idle.
Therefore; that would put the rest of the system/platform at somewhere closer to 54-58 W at idle, which means that if I can get the idle down to the sub-10W range for the rest of the platform, Iâd be saving between 44-48 W, which doesnât sound like itâs a lot, but itâs also not nothing neither.
If there is a low enough cost system/platform, it can be worth it, as that will mean that I will be able to run the system 24/7, vs. right now, where I only power it on when I am actively looking to use said local hosted AI system.
One important thing for you to notice is the reported power consumption by your software vs what your PSU is actually pulling from the wall. Itâd be interesting if you could get one of those wall watt-meters to see how much your system is actually drawing at idle.