With Mixtral (aka Mistral 8x7B MoE) beating GPT3.5 handily, the enthusiast urge to run ML models on your own hardware has never been more exciting (atleast for me).
However, my i7 12700h + 3070ti laptop gpu won’t cut it, vRAM-wise. And boy, is GPU vRAM expensive.
Which got me thinking, how well would a CPU-only inference server with tons of memory really work out in the short to medium term.
AMD Sienna + a terabyte of memory could potentially handle all the local inference we’d need. (Throw in a 16+ GB vram gpu for some offloading wherever possible).
The ability to upgrade is also very interesting. Sienna’s Zen 5 counterpart should be a decent bump in performance yet again.
CPUs or GPUs are too inefficient for these kind of loads. NPUs are the clear way forward when it comes to these kind of loads. I don’t think it’s feasable even into the near future to work on complex LLMs without proper hardware.
Because their sales pitch is to emphasize on the reduced cost and negating high power consumption. Not because Siena (not Sienna) is especially powerful. It isn’t.
Siena (not Sienna) is just a cut down Genoa for smaller power consumption. And for a TB of memory, you need 2DPC (which is 3600MT/s and bad).
I’ve seen some interesting Int8 and Int16 AI stuff on CPU…but CPUs aren’t really good in general. ASICs, FPGA, GPUs…that’s where the real perf/watt is.
So unless CPUs get their fancy inference accelerators everyone is so eager to put on their product nowadays…I doubt general purpose CPUs will be competitive. There is a good reason why we use GPUs.
And you can get a great GPU like A6000 instead of buying a whole server. Siena certainly isn’t cheap, it’s still an EPYC.