I am a lot less worried about dual channel potentially being a bottleneck for 3000 series cpu’s on ryzen now (if they are indeed as insane as the leaks suggest).
Software Engineer/Computer engineer at heart ? very nice discovery.
Do you think this will occur on all chiplet based AMD processors? Not sure if ryzen 3 is going to be chiplet based I have yet to look into the leaks
Biggest issue maintaining and providing patches and upgrades to windows from the microsoft perspective is literally the fact that there is still so much unnecessary stuff in the windows kernel still. I commend how much Windows 10 streamlined it but there are still a lot of issues. Its where linux has the advantage development wise imho
Not so much the total number of processes if you look at the numbers on windows versus linux stock (Open TOP) they are pretty similar. It really comes down to the kernel of the OS.
Its probably unrelated, but some how this feels a little bit similar,
to the performance issues that windows had wenn the first AMD FX 8 core Zambezi cpu’s came out.
They later provided some patches to sorta kinda fix it but not really.
They are definitely going to be inclined to fully fix this one. The reason being zambezi did not effect the windows server market and enterprise market much. This issue actually does so I hope to see microsoft be “more committed” to solving the issue
See, I worked on this huge project several years ago, and microsoft was involved in the fix. It involved Xeon E5 V4 CPUs and their hidden numa node and spookily-similar performance regressions.
Microsoft’s explanation of what the fix was never fully satisfied me. I have tried to find some concrete connection to the two things… but none so far. Tenuously is the connection that the first numa node other than the ideal one will also run properly. And the other two are just a swirling mess of constantly moving threads.
I ran this on my system (MEG CREATION, 2990WX, 128GB RAM @ 2933) and Indigo jumps from 2.228 room/5.530 car stock to 3.476 room/7.272 car with it enabled.
One thing I did note, was that w/process lasso & coreprio enabled there was no delta; buggish behavior prevails. Is that a noted/obvious thing; running both being an issue?
Hey Wendell - not sure if you saw this comment on the previous YouTube video but I will paste it again here just in case.
Long time viewer, first time poster, engagement yay - years ago (literally) when Ryzen first released and the whole windows scheduler “scandle” surfaced, I wrote (very rough) a tool for messing with thread scheduling - [Reddit Thread][Github Project]
Basically allows for easy thread loading “patterns” … even in some games on a 4 or 6 core Intel CPU you can see a difference (for the better) ; would be very interested to see what would happen on a Threadripper.
I don’t have a ThreadRipper CPU etc so it’s all based off assumptions instead of actually querying NUMA state etc but I’m interested to see how all this progresses as these CPU’s start to become more compelling with 3xxx series etc.
Just finished up with the video (better late than never), and looks like I’ve got some fun stuff to look at once I get home. I’ll grab the CorePrio software and see if I can’t help out. Is there anything that you’d want people to look at with respect to performance?
to your program can you expose the windows reported “ideal_cpu” ? and manipulate that? Manipulating that more than affinity might be the really interesting bit.
I’ll have to take a look again and see wither or not I could even test/debug such functionality without a NUMA based system - somewhat difficult i suspect.
The whole point of the code I posted was that it sets the thread affinity within the process… not in a scientific way but in a way that you can manipulate via the various command line switches… allowing one to make an observation as to behavior change…
It entirely stomps all over the windows scheduler (in theory ideal thread will be ignored) forcing thread affinity (again important distinction, thread affinity, not process affinity)
Another important note (not an expert on NUMA mind you) is that SetThreadIdealProcessorEx (WIN32 API call) is not something I believe the OS ever just calls by itself, if the software that would be expected to make these API calls… and ultimately the software employing questionable thread grouping.
“Specifying a thread ideal processor provides a hint to the scheduler about the preferred processor for a thread. The scheduler runs the thread on the thread’s ideal processor when possible.”
Out of interest - Have you run attached a debugger while the benchmark is running and ascertained where the contention lies? most likely it is a spinlock or mutex/criticalsection.
Is there any info if this affects only client-grade Windows systems?
Traditionally Windows client/workstation systems only ever supported max 2 sockets and thus, supposedly, 2 NUMA nodes.
What about Windows Server? The Datacenter editions support up to 64 sockets and by that should, in theory, work with more than 2 NUMA nodes without a problem. (Server 2016 Standard also supports 64 sockets)
That is the common explanation as to why Windows shuffles threads arround. However, GPUs don´t do that, are unprotected and much larger than CPUs. So thermal stress is unlikely.
I would take a blind guess and say it is “we need more features” over “we need a stable backbone”