Unlocking the 2990WX For Less-Than NUMA aware Apps | Level One Techs

It has been a long journey.

I remember the first time I ever used a dual processor system. I fondly recall it as if it were a religious experience. It was a dual socket 370 system that was running two hacked up overclocked Celeron 300A processors, the best I can recall. I’ve never gone back from single core computing.


This is a companion discussion topic for the original entry at https://level1techs.com/article/unlocking-2990wx-less-numa-aware-apps
15 Likes

Didn’t know about dual NUMA modes, thanks for that :slight_smile:
I’m just left wondering why it had to be a ‘Wendell’ that told me this… now, after months of… silence really.

Murky waters here, so… my thanks again, love learning new stuff.

Edit: No offense or anything, hopefully you get what i mean. Hopefully, lol

Edit 2: So what happens on a dual socket Epyc/Rome mobo running Windows server?
(i always thought there was ‘one’ NUMA… regardless of anything)

Are the CPUs enumerated the same under both memory (UMA/NUMA) configurations? Under the “slow” version, ideal CPU is clustered on 32 to 47.
Fast version has two clusters. One at 0-15 (~108 hits) and one at 48-63 (36/45 hits).

I wonder if it’s as simple as Windows treating the 4x4 array in ACPI as a 4x2 array. That would explain why both threadripper and EPYC have a problem.

What’s the output of /sys/bus/node/devices/node[0-3] on a 2990WX in both modes. It should be easy to guess at the algorithm based on your ideal cpu enumeration.

Apologies if I missed this information.

When your scheduler is using more CPU utilization than the actual work its scheduling.

4 Likes

Sick burn,

@wendell did you submit this to Windows of some bug money? (doubt they would pay but still)

1 Like

Very interesting.

Currious if AMD would watch this video aswell.

2 Likes

“bug” in windows kernel
or
tinfoil hat engaged

paid crippling

Would not be the first time microsoft have deliberately gimped things. DR DOS anyone? It would also be nothing like the most sketchy thing Intel has done. And it would line right up with the intel slides from 2017 about cores being glued together and and “software optimisations required” for EPYC even before anyone really tested it much.

I’m not SERIOUSLY suggesting that this is the case. Buuuut…

@wendell
Wonder if the windows scheduler has the same problem in Windows 7, 8 (or Win2008/2003) or earlier? Wonder if this is why Microsoft officially and overtly “do not support” new CPUs on prior operating systems, which is kinda new since Windows 10. It could be that this bug was introduced (either maliciously or “accidentally”) in later versions of Windows.

Is this something we could somehow test for? WIll earlier platforms even run on it? Would virtualizing them (but running only a single VM under a bare metal hypervisor) if required for driver support invalidate the scheduler testing?

I was also skeptical about memory bandwidth being the problem because:

  • intel are doing high core counts with similar memory channel counts
  • AMD have BIGGER caches than intel

Congrats and thanks for figuring it out! :slight_smile:

2 Likes

No

Toggling between slow and fast is just changing affinity via task manager, manually. Which is baffling.

lshw -H is in the anandtech article (from me) which seems to have a pretty good topology for me.
LSTopo clearly shows linux knows that only two of the nodes are attached to memory.

Would be cool wenn Microsoft would acknowledge this video.
And that they might try to fix it.

Bringing it to their attention actively in their more software developement oriented channels and forums might actually get it sent up to higher people. Im not in on those networks though. I do not do much software development but hmmm I do know a few developers that work at the Boise Office for Microsoft not that they really have the clout to help send it up the line lol

So to sum this up in layman’s terms is it right for me to say that MS Windows is gimping the performance of certain AMD cpus in certain situations and that maybe just maybe this could be intentional as per a deal struck between MS and Intel? At the very least though it can be said Windows is gimping AMD cpus. Yes?

This has not been proven and my suggestion was somewhat tongue in cheek. So i think it is definitely WAY WAY TOO EARLY to assume that. It may be/likely is just a bug.

But i do think it would be at least worth testing old windows versions for. Based on prior behavior by both Intel and Microsoft.

Either there is some unexplained strategy in the windows scheduler that we do not and never will know about as it is closed source, or its a bug.

edit:
Maybe it is some high core count monolithic die strategy thing to distribute CPU temperature across the die to avoid hot spots (so boost can run higher on intel perhaps)? Who knows.

Does the same behavior (thread juggling) happen on high core count intel (but doesn’t hurt it as bad due to inherently UMA architecture)?

Given that this problem appears to be isolated to windows, i think we should be asking the question: “Why?”.

As in “why is windows doing this in its scheduler?”… now that we know what the scheduler is doing to cause the performance regression. There’s quite obviously an algorithm difference between Windows and other platforms here.

It’s either a deliberate algorithm choice due to some benefit on some architectures, it’s an accident or something insidious. There are only 3 options.

1 Like

This is fascinating, …
and some of it is way over my head. Still, I enjoyed the read :grin:

Thank you Wendell!

1 Like

This is the kind of stuff I love this channel for. :smiley:

3 Likes

I too had a dual cpu celeron from the pentium 3 days

trying to remember the name of the motherboard…

think it was an abit

2 Likes

I have not read/watched what’s news and I am far, very far, from an expert. Disclaimer over.

I don’t see why it would not happen there too. Though whatever the reason Intel will probably end up saving this… Their new strategy is also multi chip designs so somewhere Microsoft will have to either fix this or Intel will fix it for them for their own ends.

My worry with that idea is market segmentation. Intel love this so I can easily see, assuming the above is correct, that they do fix it for enterprise where their multi chip server CPUs will be used and continuing to make monolithic consumer CPUs. So they get the speed for both and still manage to leave AMD out of the fix.

This is WILD speculation.

But at the same time with intel hurting from patches slowing them down, AMD eating them and whatever happens to them next they cannot afford to be slow even in the face if also artificially slowed AMD parts. So one way or another I expect it to be fixed.

That or people with serious workloads will just start dropping windows for something that can take their hardware and effectively run it way faster with just an OS change.

I just ran CorePrio on a server with 2x AMD Epyc 7301 CPUs, with all memory slots populated (256GB in total).
IndigoBench score was 1.4 on the bedroom scene without CorePrio, and 2.4 with CorePrio.
Here comes the weird part:
I run IndigoBench with CorePrio and get the 2.4 score. I now close IndigoBench, and open Cinebench. The score is 4000, around the same as it has always been. Now, when I go and open IndigoBench again, the score is back to 1.4. Note that this is without touching CorePrio in the meantime, so the service is still running.
Maybe CorePrio gets confused by two different programs run one after the other, both using 100% CPU?

If i close IndigoBench and restart the CorePrio service, then open and run IndigoBench again, the score goes to 2.4 again.

EDIT: If I close IndigoBench and open it again without restarting the service, at some point the score goes to 2.4 again

Isn’t this technically just a workaround and not a “fix”?
Guess we need some Microsoft people to actually “fix” it?

Unless I got something wrong here lol.

2 Likes

This is just thinking, but I remember there being some confusion like this when gen1 threadripper came along. The CPU will learn workloads so if you do a bunch at once all the same it will speed up as it can predict what is next, but switch loads constantly just treats it like a regular CPU and it starts fresh every time.

What this SenseMI? I can’t remember but it has a name.

someone didn’t read the fine print :wink: