Update 2022-09-09
Ok It looks like I was mistaken, the mainboard is innocent.
I thought I had observed that mprime unexpectedly ran out of memory if non-transparent hugepages were configured, but only on one mainboard out of two, making me suspect a problem with a mainboard.
However, when testing again now I run into the problem regardless of mainboard, and @vic tells me that it is not unexpected in this kind of circumstance.
TL;DR: Statically allocated (non-transparent) hugepages are not understood by mprime, and thus it tries to allocate more memory than available if non-transparen hugepages are configured. This leads to an Out-of-Memory event so that the process (in my case mprime) gets killed.
Original post:
I have encountered a weird memory problem with Prime95’s Torture Test 3 (“large FFTs, stresses memory controller and RAM”) on a specific EPYC mainboard: ASRock ROMED8U-2T.
The problem
Basically, as soon as I run Prime95’s “Large FFTs” torture test (Test 3) on said mainboard, from either Linux or Windows, it eats all available memory and gets killed by OOM-killer (or tanks severely in the case of Windows). The same thing happens if I run the default “Blend” test which includes test 3. Tests 1-2 works fine, but test 3 is the one supposed to test memory, thus allocating a lot of it.
Test details (from the menu of the Linux client, mprime)
Choose a type of torture test to run.
1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
3 = Large FFTs (stresses memory controller and RAM).
4 = Blend (tests all of the above).
Blend is the default. NOTE: if you fail the blend test but pass the
smaller FFT tests then your problem is likely bad memory or bad memory
controller.
Tested on Mersenne Prime Test Program: Linux64,Prime95,v29.8,build 6
, but I get the same behaviour on other versions, including the Windows version. It happens regardless of CPU (I’ve tried two different ones). My other EPYC mainboard, Supermicro H12SSL-I, has no problem with these tests. I’ve basically moved the system disks between the machines, so there should be no software differences.
So it seems that either the mainboard or Prime95 is behaving outside of expectations. I’m suspecting the mainboard, and would ask ASRock support, but I’d like to hear the opinions of others before that.
Possibly related issue?
One reason I suspect the mainboard is another memory problem, that I reported before here: there is no way on this board to read the SPD info from the DIMMs. CPU-Z, Hwinfo64, decode-dimms, and even MemTest86 fails to read it (I later found other ways to probe actual timings).
My hunch
This lead me to the following hypothesis: what if the mainboard has problems reporting memory details, in a way that makes mprime/Prime95 believe it can allocate more memory than is available?
Unlike most programs, memory benchmarks/stresstests like prime95 tries to allocate as much memory as it can get away with without crashing. Prime95’s test 3 does this, unlike tests 1 and 2. Which together with my hypothesis above would predict an out-of-memory-event.
On Linux, I could work around the issue by limiting the memory available to Prime95 using the prlimit
tool, which simply sets a limit on how much memory a process gets to use. This supports the above hypothesis.
Even if this is correct, why does it happen? How does Prime95 find out how much memory there is, and why does it work on other systems?