I just got my hands on a new Threadripper Pro 5955WX and I was rather surprised to see performance in one area fall of a cliff, compared to my old Threadripper 1920X:
I tested inference for the Blume language model, both on GPU and CPU. For Bloom 7b1 on CPU, the 1920X finished in 1m 36s 440ms and the 5955WX in 4m 13s 210ms. For Bloom 3b the 1920X did it in 28s and the 5955WX in 1m 6s 630ms.
Even when running inference on the GPU I see a slow down: 2s 400ms on the 1920X system vs 5s 40ms on the 5955WX.
What makes it more even peculiar: for minGPT the tables turn: here the 5955WX outperforms with 48m 55s 590ms to train the playChars notebook on batch size 348 vs 54m 5s 970ms on the 1920X (using the GPU) inference also increased to 6s 70ms (5955WX) over 9s 680ms (1920X).
Potential issue:
Right now I am only running Quad Channel memory (but DOCP is activated), so maybe populating all eight channels would speed this up, though I don’t have a new memory kit yet.
For the sake of testing, I used the same components on both systems:
64 GB Kingston 3200 UDIMMS, ElementaryOS 6.1 (based on Ubuntu 20.04), MSI 3090 Suprim X 24G.
Those components were moved across from one system to another.
What I noticed was that bloom inference also didn’t seem to fully utilise all cores on the 5955WX.
For comparison I also tried Fedora Workstation 37 on the 5955WX, but the results were within the margin of error.
I am a bit surprised as this and will test inference on a few more open source models as well as some internal ones to see if this Bloom specific or something more general.
On 11th page, you can find the DIMM configuration recommended by AMD. Of course you should figure out the model of your motherboard and the order of DIMM slots.
yes, that did it. That and a bit of convincing for the memory to load the DOCP profile. Sometimes it would still only clock at 2400 MT/s, even though 3200 was set. It took multiple trips to the UEFI until it would finally apply properly, but with it all working now I finally bridged the gap and now hit higher numbers in every test I ran.
Luckily this memory kit is only temporary (i.e. left over from my old system and the fastest compatible one I had on hand) and will be replaced with some LR DIMMs in a couple months.
Interesting. Asus labels them from top to bottom as H1, G1, F1, E1, A1, B1, C1 and D1; I will see what happens if I use the same slots as on ASRock, but with the Asus labels that would come out to E1, F1, C1 and D1. So far the best Results I got from H1, G1, C1, D1, basically what was recommended for Epyc.
After some further testing it seems that there is no noticeable difference in most memory configurations EXCEPT the one I initially used, but [H1, G1, A1, B1], [E1, F1, L1, D1] or [A1, B1, C1, D1] all seem to perform exactly the same, at least for Bloom inference (and that’s the only one where I noticed an issue).
What does however make a massive difference is the memory speed. I noticed that not all the time when DOCP is selected it actually applies it and I found the best way to apply the profile is to manually go to 3600, wait for it to fail to boot and then manually go back to 3200 (the default in my case); Otherwise the memory just stays in 2400 even if DOCP is on and the 3200 profile is selected. This might be a UEFI bug, not sure.
The Asus Pro WS WRX80E-Sage Wifi II; Sadly the Asus Manual contains no references besides “visit the Asus website” which only has a pdf of the same handbook.
If the RAM config is temporary and you are getting a new set, get an 8x8GB, 8x16GB, or 8x32GB set from the QVL. That fills all slots and no worries about what goes where.
yer rite its not listed…
so i would assume they went with a standard layout for the ram
in that 4 sticks populate the 2 outer most slots
g1, h1 and c1, d1
This is where it might get worse, as I am planning on 8X 128GB since I need at least 650GB for some workloads I want to run… so I wonder how that will go, since the V1 of the same board has issues according to another thread on here. I am hoping they were fixed in the Bios 0204 that ships with the V2.
I run 8x16GB on mine with zero issues, but believe it is a v1 board (Not 100% sure of that tho.) Fully populating the RAM slots should make your processor work at optimum levels. Have had zero RAM issues so far on mine. It far outperforms my old TR 1950x across the board.
Is the RAM you are using currently on the mobo QVL? If not, replacing it may fix your issues.