Unlocking the 2990WX For Less-Than NUMA aware Apps | Level One Techs

Is this unique to AMD, or could this have an impact on Intel performance as well?

That’s the debate. The single chip intel stuff should be fine but their multi chip stuff should show similar problems, though it could be to do with how they handle their memory too.

Yeah, I thought about that, but I don’t think that is what is causing the performance discrepancy, since it bounces between non-CorePrio and CorePrio speeds (71,4% more performance), and IIRC SenseMI does not have that much of an impact on performance.
Furthermore, running IndigoBench multiple times without CorePrio does not really have an impact on performance, maybe 0,02 points or something like that.

1 Like

After watching Wendell’s video I tried running IndigoBench (v4.0.64) and coreprio (v0.0.3.5) on my dual socket 32-core Xeon workstation (specs below). The score for the bedroom scene was 2.1 both with and without coreprio Numa Dissociater running.

Sysinternals coreinfo.exe reports that I have two NUMA nodes. How many does the 2990WX report?

System specs:
2 sockets / 2 NUMA nodes / 32 cores total / 64 threads
Dual Xeon CPU E5-2675 v3 @ 1.80GHz
Z10PE-D16 WS
32 GB RAM

For giggles can you set interleave to die? You will then have two numa nodes instead of 8.

Coreprio doenst have smarts for dealing with 2s systems yet . I’d guess if your program happens to o run on the first socket it resets the ideal cpu. If not it doesn’t

You can run Coreprio -b processname to get a dump of what hints windows has assigned

You should also manually set affinity via task mgr taking one core out of each processor group to see what happens.

Could also change polling interval to 50ms to see what that does?

I bet interleave die will be face melting though?

1socket 4 numa nodes

Sorry if this is naive, but isn’t this NUMA issue what AMD’s Dynamic Local Mode is supposed to address?

I didn’t see any mention of “DLM” or “dynamic” in the article or comments, either skimming or via search. Were the Windows results shown here with RyzenMaster installed and DLM enabled? If that has not been tried, it would be interesting to compare it to the performance achieved with Coreprio.

It’s not applicable for the 2990WX, as it simply doesn’t support the “Distributed” memory mode at all. Not in hardware at least, while the software / operating system of course may still choose to map continuous virtual memory to both NUMA nodes in an alternating pattern. (Achieves a remotely similar effect for random memory access, not so much for peak bandwidth on linear access though…)

Still got to ask though: On what base was memory bandwidth issues ruled out?

Affinity to a specific CPU means - even if the thread is spilling to other nodes due to the process exceeding the core count of the NUMA node - that all memory allocations happen on that specific NUMA node.

Correspondingly, when running a not NUMA aware application on a NUMA system, the (classic) default behavior is for the OS to ensure and assume that the entire process remains fully on a single NUMA node. As long as possible, at least, until either active threads or memory allocations spill.

Run Indigo 4 times in parallel on the 2990WX, each with only 16 threads, and it should scale as expected. Without messing with the default behavior of the Windows scheduler, obviously.

And please try dumping the heap for a “bad run”. I expect that you will see that memory allocations are also very lopsided towards a single node, matching the claimed affinities.

As far as I can tell on the basis that for the tested applications there was no performance difference between the octa channel EPYC and the quad channel Threadripper system when all other variables (cores, clockspeed, …) were equal (after the performance problem was “fixed”)

To me that is a reasonable conclusion. Probably it’s still possible to find memory (bandwidth) limited applications but that will always be the case and the Wendell tested real world applications.

Let me rephrase the question, on what base did you assume that you were actually using more than 2 memory channels of the quad / octa channel setups in the first place?

EDIT:
What I’m implying is, that you get a couple of peculiar effects if you put too much stress on a single NUMA node. Not only raw bandwidth, lantency and memory transactions, but also secondary effects, like hard bottlenecks in page fault interrupts and alike. Later ones are especially nasty on Windows.

So if your tested application isn’t NUMA aware, 2990WX as well as Epyc in NUMA mode are bound to perform only as good or even worse than a single module, aka Ryzen. In the worst case scenarios at least, but there are a couple of them, and not all are trivial.

This got picked up by Guru3D and HotHardware

EDIT: And Extremetech

EDIT2: And bit-tech

And Hexus:

2 Likes

Interesting Anandtech has noticed Windows scaling issues going from 2 NUMA nodes to 4 NUMA nodes before. Again AMD but I suspect nobody who publishes benchmarking can probably afford an Intel 4 NUMA node machine. Maybe AMD should allow NUMA to be disabled in the BIOS for TR

2013
https://www.anandtech.com/show/6508/the-new-opteron-6300-finally-tested/13

2011
https://www.anandtech.com/show/4486/server-rendering-hpc-benchmark-session/5

One can’t ‘just’ disable it - for Epyc, but particularly Threadripper Non-Uniform Memory Access is a quintessential part of the design, of the very heart of the machine.

It needs to be understood, tweaked and harnessed to it’s full potential with AMD’s chips - not be a handycap like it’s largely been treated in the past.

NUMA can work great if the OS is properly aware of it - linux has demonstrated that to me and many others countless times - first in the data center and then the desktop.

It’s Windows turn to adapt :slight_smile:

2 Likes

Yeah the vfio folks show this too. Gta v in an 8 core VM pinned to the die servicing the graphics card has no problems at all and it’s great. Higest fps and everything.

1 Like

x264 and x265 thread tuning next! So many people will appreciate some code tweaks to that for real time encoding with proper thread utilization on massively high core count platforms!

By disable I really meant stop reporting to windows that it is NUMA so windows thinks it UMA and so won’t try the ideal CPU. The option seems to be there for server products. I know not an ideal solution.

Alas MS can be slow and they may take the line that applications should be NUMA aware. So won’t fix.

It’d been interesting to see dumps of application which do scale well. Are they setting affinity or setting Ideal numa node for groups of threads. My guess if the app is setting ideal CPU the kernal doesn’t

I see …
In that case better ask Wendell. I don’t have any in depth knowledge on this topic and just try to learn about the peculiarities of this workstation hardware for my own work.

However, if I understand you correctly your explanation would also result in significantly worse performance of the overall EPYC and Threadripper offerings compared to Intels design. Worse than it currently is anyway.

Now this is what tech sites are supposed to do :slight_smile:

Wendell… again. :+1:

3 Likes

You would then give up the advantages of apps that need less cores being on one node.

The simple solution is for Microsoft to fix its scheduler, but considering they cannot even get it to stop throwing all its background processes at a single core i suspect it may take some time.

Core Zero For Life! Seriously it is so annoying. It will even throw games on top of everything else. But it is a bit rich me complaining about that because I have no idea how hard it is to fix.