Windows Doing Dumb Stuff on Big Systems

wendell · December 12, 2023, 4:45am

I am going to make a collection of this kind of thing:

https://answers.microsoft.com/en-us/windowserver/forum/all/is-there-a-way-to-modify-processor-groups-in/a6f0c666-f2a1-46e6-8165-8f6f1021dc8f

So Microsoft added processor groups as a way to deal with greater than 64c on a single numa node, but no one bothers to use it vs numa aware code. This was in the falcon video today along with Mark r’s thread image utility.

I found the above thread.

I am also the og person who found the windows scheduler behaves suboptimally if a numa node advertises technically it has no local memory (Intel subnuma clustering is an interesting tangent here).

The above thread is more evidence of bitrot. I can’t see that processor groups was a good solution to this problem, especially now with cxl. The cxl devices seem to be implemented as numa nodes with memory but no local compute, and once again windows’ API seems unfun to deal with here.

Anyone work on Big Compute or hear of problems with scheduling, task management😜, etc?

I know Linux has had some fun experimenting on their side too. Redhats thread tuning guide is pretty enlightening, but I haven’t seen something like that in a windows context since they don’t give you much in the way of tunables.

I’m aware of core and root scheduler options on windows, but those don’t improve the situation.

96c tr seems to have a special option that fakes 3x32c numa nodes to placate windows. Interestingly.

What suboptimal stuff have you seen or experienced??

“Prior to 5/ 2022 we thought it’d be fine to just make up numa nodes for old apps. it Was Not Fine. Processor groups!”

sirn · December 12, 2023, 5:13am

One fun stuff that I’ve been able to observe (after reading about it in Jisakuhibi) is that when a process group is asymmetrical (e.g. 64+48 in case of 56c SPR with HT), and Windows happen to assign a process to the group with fewer count, it’ll still be limited to that lower count (e.g. 48 threads) even if you manually move it to a bigger group.

There seems to be application that can cross process group just fine (Blender) and ones that seemingly can but see lower utilization on one of the group (Handbrake).

twin_savage · December 19, 2023, 4:50am

I’m not sure how tangential or direct this is, but comsol went through fixing windows memory allocation problems back back in 2018 when they dumped windows’ built in memory allocator and adopted the scalable memory allocator.
we are no longer using Windows’ built in memory allocator for computers with more than 8 cores. Instead, we are using a scalable memory allocator adapted for modern computers with many cores and several CPU sockets.
seems like maybe a similar problem is still happening?

@dahlia123 I wonder if Wendell’s post is related to your Windows performance observations on your Genoa monster.

sirn · December 19, 2023, 9:34am

Just in time for this thread, Supermicro X13SWA-TF BIOS v2.0 (W790) added Virtual NUMA option in their BIOS to split up the CPU into 2 NUMA nodes, with a description of “May help with performance on Windows”. From what I can see with 3495X, it splits 56 cores + HT into two nodes of 0-27 (physical), 56-83 (HT), and 28-55 (physical), 84-111 (HT)

Unlike Sub-NUMA Clustering (SNC2/SNC4), it looks like Virtual NUMA still acts internally as UMA (Quadrant), as all PCIe device still hanging off the first NUMA node. In a true SNC, lstopo will correctly see which PCIe device belongs to which NUMA. However, it still splits memory into two pools, so presumably side-stepping the issue Wendell mentioned (about local memory).

wendell · December 21, 2023, 1:25am

Yeah the Asus sage trx50 has a not well documented feature that splits the 96c into 3 32c numa nodes. Its pretty awesome.

Mark-D-Stroyer · December 21, 2023, 2:52am

I'll, uh...show myself out

dahlia123 · December 28, 2023, 8:59pm

I’m using 96c Epyc 9654 2S + Gigabye MZ73-LM0 on Windows 10 Pro for Workstation.

With SMT off, Windows splits 192 cores into asymmetric 4 nodes: 64 + 32 + 64 + 32. This splitting overrides the NPS=1 manual setting in bios. Theoretically, this allows symmetric memory channel distribution in a single NUMA node, but I’m not sure if Windows splits in the best way.

If NPS=2 is set manually in bios, then the nodes become symmetric 48+48+48+48. But the memory channel distribution becomes asymmetric in a node.

wendell · December 28, 2023, 10:28pm

see if you’ve got nps3? I swear “somehow” I got 3 better-balance 32c/64t numa nodes on the 96c to 1) avoid processor groups special case of developer hell and 2) better memory balance on the asus TRX50. I sent the question in about… how the hell I did that? Did I hallucinate? wtf? because I haven’t been able to reproduce it.

Even Autodesk doesn’t seem to have implemented processor groups properly at this point; I’ve not found a single case where Numa is worse than processor groups, even with crazy stuff like 64/32/32/64

dahlia123 · December 29, 2023, 4:02am

Only NPS=0, 1, 2, 4, and auto are available for my system, and I haven’t experienced 3x32c NUMA nodes (SMT off) during the test with NPS=auto so far. I always got 64+32 per socket with NPS=auto. If possible, 3 NUMA nodes on the 96c would be much better because everything becomes symmetric including cores and memory channels.

What is interesting is that when I run the COMSOL benchmark (“50GB” and “200GB” files shared by twin_savage) on my 2S system, turning off 4 CCDs and making the 2S 9654 run as 32+32+32+32 (two groups) with NPS=2 gives better performance than the original 48+48+48+48 (four groups?)with NPS=2 or 64+32+64+32 (three groups) with NPS=auto. I understand that COMSOL is NUMA-aware but this was not that helpful to run COMSOL without NUMA/group optimizations on Windows.