Unlocking the 2990WX For Less-Than NUMA aware Apps | Level One Techs

Yes I agree, I just think might be waiting a long time for microsoft to fix the scheduler.

Since this issue is to do with NUMA is windows also allocating all the memory just from the RAM directly connected to that NUMA node. So also effectively limited to two channels. Might be interesting running the test with the just two DIMMS in the sockets connected to one die. To see if the performanse is the same or lower.

The windows scheduler under NUMA is meant to try where possible to schedule the threads on the NUMA node where the memory has been allocated.

Edit:
Think I’m repeating Ext3h

I have a 2990WX installed on a MSI MEG Creation running at 3.8Ghz overclock. So far, testing has shown that with 3DMark, indeed we’re seeing something happening that is lowering the performance of this CPU.

My prior machine was a 1950 installed on an MSI Gaming Carbon Pro - and the benches follow:

(2990Wwx and 1950x with 2080 TI)
Night Raid:
2990WX (3.8Ghz: 37,653
1950X (4Ghz): 37,308

Time Spy:
2990WX (3.8Ghz): 8,751
1950X (4Ghz): 12,387

Based on the information that was posted here - this leads me to believe that there may be something related to the later DX11/DX12 code that is causing these problems when Windows see’s more than 16 cores.

In fact, It’s not too surprising that the bench for Night raid is very similar as I think the bottleneck is the GPU, not the CPU.

Thanks

It kinda feels to me like some kind of timeout issue. I.e. windows has never had to deal with memory being allocated through another core - and perhaps there is something hard-coded into windows that is indifferent to memory allocation when the NUMA has direct access - but does hit an exception that is then later handled when the NUMA does not have direct access.
Obviously, we’re not “breaking” something or erroring out. And since it works flawlessly on Linux - we can assume that it has do be related to the codebase of windows and how it handles something related to the way that the TR4 architecture is designed.

64GB, 2080 ti, Amd 2990WX system on an MSI MEG Creation board overclocked via setting 6 to 3.8Ghz.

Indigo Benchmark v4.0.64, Windows 64-bit build.
AMD Ryzen Threadripper 2990WX 32-Core Processor (AuthenticAMD) - Bedroom: 2.330 M samples/s (1280x720, 60.187 s, 152.141 samples/pixel)
GeForce RTX 2080 Ti (NVIDIA Corporation) - Bedroom: 8.003 M samples/s (1280x720, 60.125 s, 411.042 samples/pixel)
AMD Ryzen Threadripper 2990WX 32-Core Processor (AuthenticAMD) - Supercar: 4.027 M samples/s (1280x720, 60.122 s, 262.693 samples/pixel)
GeForce RTX 2080 Ti (NVIDIA Corporation) - Supercar: 26.957 M samples/s (1280x720, 60.124 s, 1583.122 samples/pixel)

Max temp 74C on the CPU.

I understand that there is some software that might be downloaded to help with the supposed issue. I will try to locate and will post results when I can.

Windows has however had to deal with different speed memory sticks, different amounts and levels of CPU cache, cpu cache misses, etc.

I don’t buy it that “oh its just because it is accessing memory via an additional hop”

And yes, as you say, linux works fine on it.

it’s definitely a windows thing… it feels like they hardcoded something to handle multiple NUMAs (meaning 2) - with the thought process of - There will only ever be 2 NUMAs per CPU die.
And now that we have 4, when it attempts to hit 3 and 4, it throws an exception, which then filters up the stack, until it get’s handled. Of course, all of that apparently takes time - which is what is slowing us down.

What I find interesting is that an older benchmark (night raid) performs equally as well as the 1950x at a slightly slower speed. Indicating that the culprit might be nestled into the newer tech that the Time Spy bench uses. (DX12?). It would be interesting to explore a bit around if it’s an “everything” issue, or just a latest library issue.

1 Like

This problem actually isn’t new.
If you ever played around with Bulldozer Opterons, or even all the way back to Magny-Cours (K10) when you had multiple CPU’s with lots of threads… the performance simply wouldn’t scale.
I remember seeing the comparisons of Opteron vs Xeon on windows and the Xeon just mopping the floor with it in ways that make no sense (like 10c vs 64c or 48c).
The problems were in windows, and honestly I don’t think all of those have ever been fixed either.

3 Likes

The partnership between Intel and Microsoft wasn’t called “Wintel” for over a decade for nothing.

1 Like

Sorry for the bump…

Did this ever get fixed?

Did you measure power consumption at the wall? That might have been a telltale sign.