Yes I agree, I just think might be waiting a long time for microsoft to fix the scheduler.
Since this issue is to do with NUMA is windows also allocating all the memory just from the RAM directly connected to that NUMA node. So also effectively limited to two channels. Might be interesting running the test with the just two DIMMS in the sockets connected to one die. To see if the performanse is the same or lower.
The windows scheduler under NUMA is meant to try where possible to schedule the threads on the NUMA node where the memory has been allocated.
I have a 2990WX installed on a MSI MEG Creation running at 3.8Ghz overclock. So far, testing has shown that with 3DMark, indeed weâre seeing something happening that is lowering the performance of this CPU.
My prior machine was a 1950 installed on an MSI Gaming Carbon Pro - and the benches follow:
(2990Wwx and 1950x with 2080 TI)
Night Raid:
2990WX (3.8Ghz: 37,653
1950X (4Ghz): 37,308
Time Spy:
2990WX (3.8Ghz): 8,751
1950X (4Ghz): 12,387
Based on the information that was posted here - this leads me to believe that there may be something related to the later DX11/DX12 code that is causing these problems when Windows seeâs more than 16 cores.
In fact, Itâs not too surprising that the bench for Night raid is very similar as I think the bottleneck is the GPU, not the CPU.
It kinda feels to me like some kind of timeout issue. I.e. windows has never had to deal with memory being allocated through another core - and perhaps there is something hard-coded into windows that is indifferent to memory allocation when the NUMA has direct access - but does hit an exception that is then later handled when the NUMA does not have direct access.
Obviously, weâre not âbreakingâ something or erroring out. And since it works flawlessly on Linux - we can assume that it has do be related to the codebase of windows and how it handles something related to the way that the TR4 architecture is designed.
I understand that there is some software that might be downloaded to help with the supposed issue. I will try to locate and will post results when I can.
itâs definitely a windows thing⌠it feels like they hardcoded something to handle multiple NUMAs (meaning 2) - with the thought process of - There will only ever be 2 NUMAs per CPU die.
And now that we have 4, when it attempts to hit 3 and 4, it throws an exception, which then filters up the stack, until it getâs handled. Of course, all of that apparently takes time - which is what is slowing us down.
What I find interesting is that an older benchmark (night raid) performs equally as well as the 1950x at a slightly slower speed. Indicating that the culprit might be nestled into the newer tech that the Time Spy bench uses. (DX12?). It would be interesting to explore a bit around if itâs an âeverythingâ issue, or just a latest library issue.
This problem actually isnât new.
If you ever played around with Bulldozer Opterons, or even all the way back to Magny-Cours (K10) when you had multiple CPUâs with lots of threads⌠the performance simply wouldnât scale.
I remember seeing the comparisons of Opteron vs Xeon on windows and the Xeon just mopping the floor with it in ways that make no sense (like 10c vs 64c or 48c).
The problems were in windows, and honestly I donât think all of those have ever been fixed either.