16 Zen 3 cores SLOWER than 12 Zen 1 Cores?

I just got my hands on a new Threadripper Pro 5955WX and I was rather surprised to see performance in one area fall of a cliff, compared to my old Threadripper 1920X:

I tested inference for the Blume language model, both on GPU and CPU. For Bloom 7b1 on CPU, the 1920X finished in 1m 36s 440ms and the 5955WX in 4m 13s 210ms. For Bloom 3b the 1920X did it in 28s and the 5955WX in 1m 6s 630ms.
Even when running inference on the GPU I see a slow down: 2s 400ms on the 1920X system vs 5s 40ms on the 5955WX.

What makes it more even peculiar: for minGPT the tables turn: here the 5955WX outperforms with 48m 55s 590ms to train the playChars notebook on batch size 348 vs 54m 5s 970ms on the 1920X (using the GPU) inference also increased to 6s 70ms (5955WX) over 9s 680ms (1920X).

Potential issue:
Right now I am only running Quad Channel memory (but DOCP is activated), so maybe populating all eight channels would speed this up, though I don’t have a new memory kit yet.

For the sake of testing, I used the same components on both systems:
64 GB Kingston 3200 UDIMMS, ElementaryOS 6.1 (based on Ubuntu 20.04), MSI 3090 Suprim X 24G.
Those components were moved across from one system to another.
What I noticed was that bloom inference also didn’t seem to fully utilise all cores on the 5955WX.

For comparison I also tried Fedora Workstation 37 on the 5955WX, but the results were within the margin of error.
I am a bit surprised as this and will test inference on a few more open source models as well as some internal ones to see if this Bloom specific or something more general.

you might need to populate certain slots to correctly associate with the chiplets properly

4 Likes

I am still trying to figure out which slots those are, sadly the motherboard handbook has no indicator, unlike my old X399 one.

I now moved the memory to A1, B1, C1 and D1, in hopes they’d be in order, but this only resulted in a marginal difference.

I’ve heard some problems with those on TR pro, maybe that’s the culprit

3 Likes

8-channel configuration is always strongly recommended, but you may refer to this document when you are using 4-channel.

https://www.amd.com/system/files/TechDocs/56873_0.80_PUB.pdf

On 11th page, you can find the DIMM configuration recommended by AMD. Of course you should figure out the model of your motherboard and the order of DIMM slots.

1 Like

yes, that did it. That and a bit of convincing for the memory to load the DOCP profile. Sometimes it would still only clock at 2400 MT/s, even though 3200 was set. It took multiple trips to the UEFI until it would finally apply properly, but with it all working now I finally bridged the gap and now hit higher numbers in every test I ran.

Luckily this memory kit is only temporary (i.e. left over from my old system and the fastest compatible one I had on hand) and will be replaced with some LR DIMMs in a couple months.

8 dime fully populated.
4 dims use a1 b1 g1 h1
2 dims use c1 d1
1 dim use c1

chap 2.3
page 25 of the user manual

user manual english…

you may also see a difference using dual rank or single rank dims.
if you have the option of testing of course.

1 Like

Interesting. Asus labels them from top to bottom as H1, G1, F1, E1, A1, B1, C1 and D1; I will see what happens if I use the same slots as on ASRock, but with the Asus labels that would come out to E1, F1, C1 and D1. So far the best Results I got from H1, G1, C1, D1, basically what was recommended for Epyc.

After some further testing it seems that there is no noticeable difference in most memory configurations EXCEPT the one I initially used, but [H1, G1, A1, B1], [E1, F1, L1, D1] or [A1, B1, C1, D1] all seem to perform exactly the same, at least for Bloom inference (and that’s the only one where I noticed an issue).

What does however make a massive difference is the memory speed. I noticed that not all the time when DOCP is selected it actually applies it and I found the best way to apply the profile is to manually go to 3600, wait for it to fail to boot and then manually go back to 3200 (the default in my case); Otherwise the memory just stays in 2400 even if DOCP is on and the 3200 profile is selected. This might be a UEFI bug, not sure.

1 Like

? i thought you had asroc not asus… my bad.
i actually thought you just missed the mem config section :smiley:

which asus board?..

The Asus Pro WS WRX80E-Sage Wifi II; Sadly the Asus Manual contains no references besides “visit the Asus website” which only has a pdf of the same handbook.

If the RAM config is temporary and you are getting a new set, get an 8x8GB, 8x16GB, or 8x32GB set from the QVL. That fills all slots and no worries about what goes where.

yer rite its not listed…
so i would assume they went with a standard layout for the ram
in that 4 sticks populate the 2 outer most slots
g1, h1 and c1, d1

1 Like

This is where it might get worse, as I am planning on 8X 128GB since I need at least 650GB for some workloads I want to run… so I wonder how that will go, since the V1 of the same board has issues according to another thread on here. I am hoping they were fixed in the Bios 0204 that ships with the V2.

I run 8x16GB on mine with zero issues, but believe it is a v1 board (Not 100% sure of that tho.) Fully populating the RAM slots should make your processor work at optimum levels. Have had zero RAM issues so far on mine. It far outperforms my old TR 1950x across the board.

Is the RAM you are using currently on the mobo QVL? If not, replacing it may fix your issues.