2990WX Threadripper Performance Regression FIXED on Windows* #threadripper | Level One Techs

*At least in cases like this one :D [ sry, clickbait works, but see these awesome articles below v]


This is a companion discussion topic for the original entry at https://level1techs.com/video/2990wx-threadripper-performance-regression-fixed-windows-threadripper
13 Likes

great vid; Those are the kinds i LOVE!

3 Likes

Fantastic video… it will be interesting if someone at Microsoft who knows what the fuck they are doing gets in touch with you.

Which Windows 10 version were you using? If it was said in the video I missed it.

Awesome

I am a lot less worried about dual channel potentially being a bottleneck for 3000 series cpu’s on ryzen now (if they are indeed as insane as the leaks suggest).

1 Like

Well yeah Windows has soo much unessesary crap running in the background,
compaired to linux.

1 Like

It would make my day of Mark Russinovitch said hi. Been following him since before Microsoft hired him.

Every blog post of his back in the day was “Damn, Microsoft needs to hire this guy, yesterday.” They finally did :smiley:

TimmyTechTV might be able to help us out here, too. It sure is spoopy.

4 Likes

Ive watched the interviews with him on channel 9 on MS site and a few deep dives. He is a great guy. Smart

Would be nice if he tackled this issue considering rumors of 16 core AM4 chips announced next week.

The rumored 12 core 5.0Ghz boost CPU is a must buy if true.

1 Like

Software Engineer/Computer engineer at heart ? :sweat_smile: very nice discovery.

Do you think this will occur on all chiplet based AMD processors? Not sure if ryzen 3 is going to be chiplet based I have yet to look into the leaks

Biggest issue maintaining and providing patches and upgrades to windows from the microsoft perspective is literally the fact that there is still so much unnecessary stuff in the windows kernel still. I commend how much Windows 10 streamlined it but there are still a lot of issues. Its where linux has the advantage development wise imho

Not so much the total number of processes if you look at the numbers on windows versus linux stock (Open TOP) they are pretty similar. It really comes down to the kernel of the OS.

Its probably unrelated, but some how this feels a little bit similar,
to the performance issues that windows had wenn the first AMD FX 8 core Zambezi cpu’s came out.
They later provided some patches to sorta kinda fix it but not really.

2 Likes

They are definitely going to be inclined to fully fix this one. The reason being zambezi did not effect the windows server market and enterprise market much. This issue actually does so I hope to see microsoft be “more committed” to solving the issue

See, I worked on this huge project several years ago, and microsoft was involved in the fix. It involved Xeon E5 V4 CPUs and their hidden numa node and spookily-similar performance regressions.

Microsoft’s explanation of what the fix was never fully satisfied me. I have tried to find some concrete connection to the two things… but none so far. Tenuously is the connection that the first numa node other than the ideal one will also run properly. And the other two are just a swirling mess of constantly moving threads.

4 Likes

What was the logic in shuffling threads around the cores. There must be history to doing it like heat spreading etc.

1 Like

I ran this on my system (MEG CREATION, 2990WX, 128GB RAM @ 2933) and Indigo jumps from 2.228 room/5.530 car stock to 3.476 room/7.272 car with it enabled.

One thing I did note, was that w/process lasso & coreprio enabled there was no delta; buggish behavior prevails. Is that a noted/obvious thing; running both being an issue?

Hey Wendell - not sure if you saw this comment on the previous YouTube video but I will paste it again here just in case.

Long time viewer, first time poster, engagement yay - years ago (literally) when Ryzen first released and the whole windows scheduler “scandle” surfaced, I wrote (very rough) a tool for messing with thread scheduling - [Reddit Thread] [Github Project]

Basically allows for easy thread loading “patterns” … even in some games on a 4 or 6 core Intel CPU you can see a difference (for the better) ; would be very interested to see what would happen on a Threadripper.

I don’t have a ThreadRipper CPU etc so it’s all based off assumptions instead of actually querying NUMA state etc but I’m interested to see how all this progresses as these CPU’s start to become more compelling with 3xxx series etc.

2 Likes

Just finished up with the video (better late than never), and looks like I’ve got some fun stuff to look at once I get home. I’ll grab the CorePrio software and see if I can’t help out. Is there anything that you’d want people to look at with respect to performance?

Also, awesome job tracking this down!

to your program can you expose the windows reported “ideal_cpu” ? and manipulate that? Manipulating that more than affinity might be the really interesting bit.

I’ll have to take a look again and see wither or not I could even test/debug such functionality without a NUMA based system - somewhat difficult i suspect.

The whole point of the code I posted was that it sets the thread affinity within the process… not in a scientific way but in a way that you can manipulate via the various command line switches… allowing one to make an observation as to behavior change…

It entirely stomps all over the windows scheduler (in theory ideal thread will be ignored) forcing thread affinity (again important distinction, thread affinity, not process affinity)

Another important note (not an expert on NUMA mind you) is that SetThreadIdealProcessorEx (WIN32 API call) is not something I believe the OS ever just calls by itself, if the software that would be expected to make these API calls… and ultimately the software employing questionable thread grouping.

1 Like

The API states the following:

“Specifying a thread ideal processor provides a hint to the scheduler about the preferred processor for a thread. The scheduler runs the thread on the thread’s ideal processor when possible.”

Out of interest - Have you run attached a debugger while the benchmark is running and ascertained where the contention lies? most likely it is a spinlock or mutex/criticalsection.

Very outstanding work Wendell!

Is there any info if this affects only client-grade Windows systems?
Traditionally Windows client/workstation systems only ever supported max 2 sockets and thus, supposedly, 2 NUMA nodes.

What about Windows Server? The Datacenter editions support up to 64 sockets and by that should, in theory, work with more than 2 NUMA nodes without a problem. (Server 2016 Standard also supports 64 sockets)

That is the common explanation as to why Windows shuffles threads arround. However, GPUs don´t do that, are unprotected and much larger than CPUs. So thermal stress is unlikely.

I would take a blind guess and say it is “we need more features” over “we need a stable backbone”