Viability of self-hosted CPU Inference in 2023

With Mixtral (aka Mistral 8x7B MoE) beating GPT3.5 handily, the enthusiast urge to run ML models on your own hardware has never been more exciting (atleast for me).

However, my i7 12700h + 3070ti laptop gpu won’t cut it, vRAM-wise. And boy, is GPU vRAM expensive.

Which got me thinking, how well would a CPU-only inference server with tons of memory really work out in the short to medium term.

The Neuralmagic folks seem to like the AMD Sienna (especially when paired with their software) Optimal CPU AI Inference with AMD EPYC™ 8004 Series Processors and Neural Magic DeepSparse - Neural Magic

AMD Sienna + a terabyte of memory could potentially handle all the local inference we’d need. (Throw in a 16+ GB vram gpu for some offloading wherever possible).

The ability to upgrade is also very interesting. Sienna’s Zen 5 counterpart should be a decent bump in performance yet again.

What are people’s thoughts around this?

CPUs or GPUs are too inefficient for these kind of loads. NPUs are the clear way forward when it comes to these kind of loads. I don’t think it’s feasable even into the near future to work on complex LLMs without proper hardware.

Because their sales pitch is to emphasize on the reduced cost and negating high power consumption. Not because Siena (not Sienna) is especially powerful. It isn’t.

Siena (not Sienna) is just a cut down Genoa for smaller power consumption. And for a TB of memory, you need 2DPC (which is 3600MT/s and bad).

I’ve seen some interesting Int8 and Int16 AI stuff on CPU…but CPUs aren’t really good in general. ASICs, FPGA, GPUs…that’s where the real perf/watt is.

So unless CPUs get their fancy inference accelerators everyone is so eager to put on their product nowadays…I doubt general purpose CPUs will be competitive. There is a good reason why we use GPUs.

And you can get a great GPU like A6000 instead of buying a whole server. Siena certainly isn’t cheap, it’s still an EPYC.