5995WX NUMA mode

I have an Asus Sage II motherboard with a 5995wx and 1TB Ram (2933MHz)

I am having a weird issue where it is doing very poor performance for a python based quality signal processing script. I have a few Dell R630s that have 88 threads each and they process about 3000 documents a second. My TR Pro machine (using the same version of software Ubuntu 20.04, python3.10, etc) starts out at doing like 11k documents a second and then slowly drops to 200 documents a second and maintains that speed. I have looked at cpu, disk, and RAM bottlenecks and nothing is coming back as the smoking gun. I started looking at NUMA configs and noticed that it shows as a single NUMA node. I tried changing it in the BIOS to NPS4 and on save, the machine reboots, turns off completely and then boots back up with the previous settings. I cannot get it to manually set the NUMA nodes per socket or be able to disable memory interleaving. I am on bios version 1501. Any ideas?

it is an ASUS Pro WS WRX80E-SAGE SE WIFI II, forgot to put the entire mobo model.

smells like memory dimm layout. what’s the dimm layout look like?

I’d also install lm sensors as it there’s hints of not good heatsink contact in the aroma too … could be thermals in general

2 Likes

The memory layout is:

Samsung 128GB DDR4-2933 (M393AAG40M3B-CYF) x 8

I currently have them water cooled (Bitspower 4 Dimm x2). But yeah… those guys get hot.

If it is a memory issue, would the CPU load not diminish as the RAM struggles to keep up? the CPU usage stays at 90% overall crunching 11k file a second, or 200 files a second. I was originally running the workload in a WSL2 instance, and switched to a native Ubuntu OS to rule out software/virtualization.

1 Like

I will try to either get an LM sensor on it (if I can find them) or maybe just shoot it with IR gun.

What I’m thinkin too

Exactly what I suspected from the rip.

You outlined multiple problems simultaneously
NUMA is related to RAM config

Declining performance is typically thermal throttling of some sort (or poor memory management in python)

Run some benchmarks and watch your clock speeds.

Sounds like a thermal throttle

ok, ill figure out how to monitor in ubuntu, and put it back under load. Ill share what I find!

1 Like

I cant seem to figure out how to get lm-sensors to see the RAM. I have Tctl, 2 Composites, and a Sensor 1 in psensor, and lm-sensors:
theskaz@KevenPC:~$ sensors
k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +35.4°C

nvme-pci-2c00
Adapter: PCI adapter
Composite: +28.9°C (low = -20.1°C, high = +74.8°C)
(crit = +79.8°C)

nvme-pci-2d00
Adapter: PCI adapter
Composite: +58.9°C (low = -273.1°C, high = +71.8°C)
(crit = +84.8°C)
Sensor 1: +58.9°C (low = -273.1°C, high = +65261.8°C)

Here is so info. Screenshots below show the docs per second at the beginning, and then about 10 minutes later. You can see that I went from 12k to 1.5k in that time and the relative temps of the DIMMs and CPU at that time. The max on the dimms are 63, then the fans kicked in and brought it down. Keep in mind that this is all one watercooled system.


Watercooling loop. 2x 480MM rads.

as soon as I find my blower doodad ill clean it lol

after a couple of hours of running, it DIMMs range from 45-66C and CPU is 45C with the liquid at 31C. 7 dimms fall between 45-50C and one is an oddball at 66C. Also wanted to note that everything is extremely sluggish. for example, to enter htop in the terminal, it took over a minute.

Can you please run the workload and monitor the used frequency of the cores during the run?

watch -n 1 cat /proc/cpuinfo | grep MHz

Do you have PCIe Advanced Error Reporting (AER) enabled in the motherboard’s BIOS?

Extensive PCIe Bus Error Correction can cause the symptoms you’ve described. With AER enabled you should be able to see PCIe Bus Errors in the operating system’s logs otherwise these might stay hidden and you’re in the dark why a system is acting weirdly since the platform doesn’t report them to an OS.

Thermal expansion and contraction of the metal in your watercooling blocks that are also linked together with rigid fittings can slightly move the dGPUs in the PCIe slots which might lead to a bad pin-pad contact in a PCIe slot.

(One of the reasons I prefer loose soft tubing and never cooling multiple components as a fixed unit even if it looks less dope)

That command as is, shows nothing, but dropping the grep and understanding we are looking at CPU MHz It ranges from 2700 - 3990. It only stays at 2700 for a single tick, then back to the 39xx

image

1 Like

Sorry the watch command interferes with the output it seems. Please run the command without the watch like the following manually a couple times during the execution of the application. It should show all your 64 processors, in your screenshot you only see the first core. My suspicion here is that maybe the cores resort to a low frequency because of power management troubles?

cat /proc/cpuinfo | grep MHz

If you application applies load to all cores they should all be around the base frequency of the processor, in your case that would be about 2.7GHz under full load.

1 Like

Thanks!

I have it throttled to 88 CPUs to see if it works better (and it does 500 Docs/s vs ~100 )

I tried looking for this in the BIOS to enable without success. I found a setting that references it (ACS Enable), but could not find the original

Does not seem like the cores are throttling either. I am out of guesses on this one, sorry.

1 Like

@wendell

Do you happen to know if there are explicit PCIe AER settings in Zen 3 Threadripper motherboard UEFIs?

I’m unfortunately a Threadripper ownership virgin (wanted to go for Zen 4 but am not happy with AMD’s product segmentation choices), my experiences only come from AM4/AM5 motherboards and their PCIe behavior.

1 Like