Hi all,
I recently purchased an Ebay special Xeon Max 9480 with a Supermicro MBD-X13SEI-F-(O/B). I bought this chip to replace my EPYC 7713 and Asrock rack rome server board so that I could experiment with llama.cpp (with AMX), bitnet.cpp and lavapipe with higher thread count.
It feels quick in windows but when I run LM Studio(fancy GUI for LLama.cpp) and run it across all 112 threads I get lower results than 32 threads. I reckon the issue is that I am getting severely throttled. I was seeing 500MHz reported under a power virus(fully loaded llama.cpp). HWinfo is reporting 260W CPU draw under load. Intel has these chips marketed for 350W and the board has 2 8-pin cpu headers for extra power. How can I let the CPU stay clocked higher or set manual multipliers?
I fiddled with the power settings in the bios but its a bit cryptic and none of the obvious windows power plan tricks worked. Neither did throttlestop and xtu refuses to even install due to platform check. Any ideas?
How are you cooling it and what temperatures are you seeing? These CPUs will hit thermal throttle roughly 20C sooner than “normal” processors so they need significantly more cooling than their wattage would suggest.
Also it might be hard to tune it to use the memory most optimally in Windows, I think Linux might be needed to segment processes into the nearest HBM stack.
I’m eventually going to write a guide on how to do this once I figure out how to cool mine.
Those temps should be good so I doubt they are the problem. Have you tried disabling C-states in BIOS (or atlast some of the higher C-states)?
For reference on power usage, running two of them on an asus z13pe-d16, I’m seeing 1100-1200 watts at the wall when underload, this is with one of the asus OC options on though, but all cores will stay at 2.6GHz underload.
What I know about Xeons is that they need a lot more cooling than most other processors, so you may need a more powerful cooler. Also, make sure you’ve disabled C-states and enabled turbo mode if available.
I briefly hit 320W @2GHz but it was shortlived, then it clocks itself down to around 270W per HWmon
Ill try to get passmark tonight, I didnt disable hyperthreading do you think this will keep the clocks high? I reckon HT will help with Llama.cpp performance.
The RAM isnt straightforward, I think its running the HBM2 in cache mode, the bios settings werent very clear. Im also on windows so Im not sure if windows can even handle it appropriately. The only good sign is that AIDA64 trial said I was doing 460GBps on 2 ddr5 dimms so I assume thats the HBM caching.
llama.cpp uses AVX2 extensions that have their own power limits, I wonder is this is what is tripping the CPU up? I wonder if AVX412 and AMX have the same behaviors?
Here the 32T run has clocks over 3GHz consistently
`(base) PS C:\Users\Cam\Desktop\repos\llama.cpp\build\bin\Release> llama-cli -m C:\Users\Cam.cache\lm-studio\models\NousResearch\Hermes-3-Llama-3.1-8B-GGUF\Hermes-3-Llama-3.1-8B.Q4_K_M.gguf -p “I believe the meaning of life is” -n 128 -t 32
llama_perf_sampler_print: sampling time = 24.83 ms / 136 runs ( 0.18 ms per token, 5476.58 tokens per second)
llama_perf_context_print: load time = 4542.30 ms
llama_perf_context_print: prompt eval time = 102.92 ms / 8 tokens ( 12.87 ms per token, 77.73 tokens per second)
llama_perf_context_print: eval time = 8018.75 ms / 127 runs ( 63.14 ms per token, 15.84 tokens per second)
llama_perf_context_print: total time = 8183.02 ms / 135 tokens`