[SOLVED] Prime95 (mprime) memory alloc runaway on ASRock ROMED8U-2T

oegat · August 1, 2022, 8:45pm

Update 2022-09-09

Ok It looks like I was mistaken, the mainboard is innocent.

I thought I had observed that mprime unexpectedly ran out of memory if non-transparent hugepages were configured, but only on one mainboard out of two, making me suspect a problem with a mainboard.

However, when testing again now I run into the problem regardless of mainboard, and @vic tells me that it is not unexpected in this kind of circumstance.

TL;DR: Statically allocated (non-transparent) hugepages are not understood by mprime, and thus it tries to allocate more memory than available if non-transparen hugepages are configured. This leads to an Out-of-Memory event so that the process (in my case mprime) gets killed.

Original post:

I have encountered a weird memory problem with Prime95’s Torture Test 3 (“large FFTs, stresses memory controller and RAM”) on a specific EPYC mainboard: ASRock ROMED8U-2T.

The problem

Basically, as soon as I run Prime95’s “Large FFTs” torture test (Test 3) on said mainboard, from either Linux or Windows, it eats all available memory and gets killed by OOM-killer (or tanks severely in the case of Windows). The same thing happens if I run the default “Blend” test which includes test 3. Tests 1-2 works fine, but test 3 is the one supposed to test memory, thus allocating a lot of it.

Test details (from the menu of the Linux client, mprime)

Choose a type of torture test to run.
  1 = Smallest FFTs (tests L1/L2 caches, high power/heat/CPU stress).
  2 = Small FFTs (tests L1/L2/L3 caches, maximum power/heat/CPU stress).
  3 = Large FFTs (stresses memory controller and RAM).
  4 = Blend (tests all of the above).
Blend is the default.  NOTE: if you fail the blend test but pass the
smaller FFT tests then your problem is likely bad memory or bad memory
controller.

Tested on Mersenne Prime Test Program: Linux64,Prime95,v29.8,build 6, but I get the same behaviour on other versions, including the Windows version. It happens regardless of CPU (I’ve tried two different ones). My other EPYC mainboard, Supermicro H12SSL-I, has no problem with these tests. I’ve basically moved the system disks between the machines, so there should be no software differences.

So it seems that either the mainboard or Prime95 is behaving outside of expectations. I’m suspecting the mainboard, and would ask ASRock support, but I’d like to hear the opinions of others before that.

Possibly related issue?

One reason I suspect the mainboard is another memory problem, that I reported before here: there is no way on this board to read the SPD info from the DIMMs. CPU-Z, Hwinfo64, decode-dimms, and even MemTest86 fails to read it (I later found other ways to probe actual timings).

My hunch

This lead me to the following hypothesis: what if the mainboard has problems reporting memory details, in a way that makes mprime/Prime95 believe it can allocate more memory than is available?

Unlike most programs, memory benchmarks/stresstests like prime95 tries to allocate as much memory as it can get away with without crashing. Prime95’s test 3 does this, unlike tests 1 and 2. Which together with my hypothesis above would predict an out-of-memory-event.

On Linux, I could work around the issue by limiting the memory available to Prime95 using the prlimit tool, which simply sets a limit on how much memory a process gets to use. This supports the above hypothesis.

Even if this is correct, why does it happen? How does Prime95 find out how much memory there is, and why does it work on other systems?

oegat · August 8, 2022, 7:19pm

Updates: Hugepages was a part of the problem. Not the entire problem though, as the mainboard model still explains variance in the bug’s occurence (see below).

Not specific to Prime95

First, my last post neglected something obvious - I did not test with another program expected to hog all available memory just for the sake of it. stressapptest does this, and indeed leads to the same result (OOM-kill). As I sort of expected.

Still specific to one mainboard, but also to Hugepages

Second, I found that the problem seems related to hugepages: I was using the boot drive from my other EPYC build with Supermicro H12SSL, and I remembered having configured hugepages in the past. Namely, I had this line in /etc/sysctl.conf:

vm.nr_hugepages = 6144

Removing that line fixed the out-of-memory problem on the ROMED8U-2T. However, the H12SSL never had the problem, with or without hugepages. Why? Still no idea.

> grep -i hugepagesize /proc/meminfo  
Hugepagesize:       2048 kB

Hmm…

> ulimit -H -l
65536

Hm hmm.

I could replicate the issue with 64 or 128Gb RAM alike.

vic · September 4, 2022, 4:49am

Prime95/mprime is pretty good at allocating memory. Very occasionally it’ll OOM. So I always turn down a knob from the suggested memory value when performing large FFT tests.

Your problem is solely hugepages though.

oegat · September 4, 2022, 8:44am

Thanks, could you elaborate on this? I’m wondering whether I’m doing something wrong when setting up hugepages, that could explain why it fails on one motherboard but not another.

My current solution is to not use hugepages at all on the ASRock based machine, not because I have to be able to run Prime95, but because I don’t know what else could fail with hugepages enabled.

vic · September 4, 2022, 12:31pm

There are two kinds of hugepages: transparent and non-transparent.

Applications who benefit from hugepages will have code written to take advantage of transparent hugepages automatically. No user configuration needed. QEMU is an example. When application quits, the hugepages will be returned automatically to free memory pool.

Applications requiring large and fixed amount of memory can also be manually configured to use non-transparent hugepages when the application is launched. QEMU is again an example. Usually this way is more efficient. However, you will need to manually release non-transparent hugepages to the free memory pool.

I believe you allocated 16GB of non-transaprent hugepages on startup. Not sure if any applications were manually made to use them. Seems they were never manually released. In practice, that means your free memory pool is less than 16GB to most applications.

Mprime isn’t designed to make use of any kind of hugepages. So when you ran Mprime, it thought you had close to 32GB free, but actually only close to 16GB free available. Hence, OOM.

oegat · September 6, 2022, 6:09pm

Right, my intention that time had been to manually configure (non-transparent) hugepages for use with a VM with 8Gb. I had allocated 6144*2M = 12Gb (out of 64 in total) in order to overshoot the capacity (leaving the tweaking for later). Then I had forgot about it, and remembered it when running into this issue and thinking about possible causes.

My understanding of hugepages configuration is quite shallow, so I’m interested in learning if I’m not doing it optimally. I wasn’t sure whether I needed to release it, or whether they could also be used by other software at will. Thanks for clearing that up.

I can understand if it is the case that some software can fail to detect that this memory chunk is unavailable, and thus OOM. What I don’t understand, in that case, is why I had only had the problem on one mainboard and not another?

vic · September 7, 2022, 3:36am

I misread your figures in previous posts. lol. I meant to say in my last reply “16GB” should read as “12GB”, and “32GB” should read as “64GB” The logics remain the same and should follow through with the updated figures.

I think it’s unlikely a motherboard could make a difference since memory management is a pretty generic thing. If you look at ‘/proc/meminfo’ on both systems, what differences that stand out?

oegat · September 9, 2022, 5:47pm

Ok, I believe you are right! I can elicit the OOM-killing on H12SSL too now, in conflict with my earlier observation. (I did not test the ASRock board again).

On H12SSL:
128Gb total / 12Gb hugepages => mprime test 3 gets OOM-killed within 30s
128Gb total / 0 hugepages => mprime test 3 eats about 95% memory and keeps running

It took me a few days before I was back with the machines. The Linux installation (Ubuntu) with which I observed the discrepancy is gone, but now with another Linux installation (Pop OS) I can replicate the failure also on the H12SSL.

So it looks like either I was mistaken about having hugepages allocated when managing to run mprime without problem in the past, or some other lucky circumstance was present. I suspect the former after all.

Case closed!

system · June 10, 2023, 11:48am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.