Return to Level1Techs.com

Ryzen 3900X memory errors in Ubuntu 20.04

System keeps locking up totally. Logs suggest memory issues. Recently did a fresh install of Ubuntu 20.04. Was previously running Ubuntu 18.04.4 (kernel 5.3 i think?). Never had issues previously, running since ~December 2019, no issues. Now I’m getting total lock ups. Sometimes theres an error log. Sometimes not. I’m doing genomics data analysis so usually running cli programs leveraging R.

I made no hardware changes to my system. Just a fresh install of Ubuntu 20.04 about 2 weeks ago. My 18.04 install was chaos with a patchwork of conflicting dependencies and it was frustrating so I wanted a fresh start… oh how naive of me

Logs:
see attached
log-messages.txt (771 Bytes)
log-messages-2.txt (186 Bytes)
updated bios and therefore microcode, still getting what looks like memory errors
log-messages-3.txt (443 Bytes)

Hardware:

  • 3900x
  • scythe mugen 5 rev.b cpu cooler with noctua fan @ 100% (never goes over ~60C)
  • b450m-pro4 (just updated to latest bios & microcode, still issues)
  • 64GB DDR4 2666mhz
  • Corsair CX550M PSU
  • a couple nvme ssds
  • quadro p620 in a chipset x4 slot b/c I only need basic display (proprietary drivers)

Some current BIOS settings:

  • Stock CPU voltage and clocks, no XFR
  • DDR4 2666 1.2v stock timings (previously running 2933 @1.35v in 18.04.4 without issue… lowered to 2400 still issues)
  • AER enabled
  • IOMMU enabled
  • CSM disabled
  • SVM mode enabled
  • C-states disabled

I’m running a python3 program leveraging a couple packages: R, numpy, rpy2. I really don’t think it’s the issue as it’s locking up randomly when running the program, not consistently at the same step of analysis. Program is single threaded, not loading the CPU much at all. Memory usage is low, maybe 10GB at most.

This really seems like a software thing seeing as everything was fine under 18.04.4. Or maybe it’s hardware degradation? Seems really unlikely. I never ran crazy voltages or temps on anything.

Is there a better way to log what’s going on?

I suppose the “smartest” solution is to go back to 18.04.4?

Maybe boot memtest and let it go through a few passes? I think ubuntu still includes it as an option from the grub menu.

Well reinstall 18.04 and see what happens?

What brand and type Ram do you have? ECC or non-ECC?

Good idea, haven’t run memtest in a long time. Will test. Going to take a while with 64GB :weary:

Non-ECC Corsair LPX CMK64GX4M4A266C16W with SK Hynix ICs.

Ran for 6 months at 2933mhz and C18-19ish timings without issue @ 1.30V

Other things of note:
Only started having issues when I updated to 20.04
This particular program has recently been patched to be compatible with Python3 instead of Python2. Worked with the author on a few patches, seemed to fix all issues. When i run the program with truncated samples/files, just a few MB in size, for a quick test run, it completes flawlessly. When I run it with my full size samples, >8GB each, it locks up my system during a later step of the analysis. I ran it with python3 --trace option, it locks up roughly during the same phase of analysis, but different functions, it’s not consistently at the same command. One time it was doing some vector math, another time it was pooling some values from multiple files into a single file. It’s usually a during a command that has something to do with R though. IDK i just run the programs, I don’t write them. These small “homemade” genomics programs are a huge variable so I didn’t want to throw that out there but maybe it’s important. I can try some other genomics programs to really hammer the CPU.

Update:
Just locked up again this time it looks like,
NVRM: Xid (PCI:0000:06:00): 79, pid=1206, GPU has fallen off the bus.

log-messages-4.txt (4.5 KB)

Lot of people saying maybe its power supply issues? Either power delivery of the motherboard or the PSU itself. Though my PSU isn’t being heavily taxed for a 550W model and I’ve hammered this system in the past without issue, there’s no overheating as all the fans are at 100%.

More googling suggests it’s c-state issues conflicting with something in the Ubuntu 20.04 kernel?

More reading, hmm maybe PCIe power management, messing with the gpu in a chipset pcie slot?

no errors in memtest x86

disabled C6 c-state. Set boot flag to include ‘processor.max_cstate=5’

because the gpu is dropping out I added ‘pcie_aspm=off’ as well

locked up within 5 minutes. :dizzy_face:

How long did you run memtest?

And these crashes and lock ups you are talking about. All of this happens when you use “this particular program” ?

Do you do anything else with this computer? Does any other workload cause a crash?

I probably should have titled my post, “potential hardware issues” rather than memory errors. It just looked like memory errors at first glance :sweat_smile:

I ran one pass, no errors. I don’t think it’s memory errors, I’ve never had memory just start “going bad” over time when run at normal speeds and voltages. In my experience memory is either flaky from the start or good basically forever. Obviously wouldn’t stick with flaky memory, this stuff has been fine for like 2 years (previously ran it with a R7 2700)

I tried running other programs that hammer the cpu, such as building a genome index, or hit the cpu and i/o hard by compressing/decompressing massive files with pigz. No issues. The offending python3 program is totally single threaded, mostly using R, numpy and rpy2. I wonder if it’s a weird c-state thing with only a single thread loaded?

I did a fresh install of 18.04.4, no issues with locking up after a few test runs with the previously offending program. No clue what’s going on with 20.04. 18.04.4 is using kernel 5.3, 20.04 is using 5.4. AFAIK from some googling there’s been no major changes to how 5.4 handles Ryzen platforms :man_shrugging:

That board is … not great. Apart from a 3900X being a bit much for it, I had one of those for a few days and had to send it back because it wasn’t stable. Since then I’ve seen a few others having problems on it.

You’re right, it’s not a uber premium motherboard. But it’s basically on par, in terms of features and vrm quality, with any generic OEM board that would run the 3900X. I’m not overclocking the CPU, and yeah I can’t get 3600mhz memory running. I just need normal, basic function, not trying to squeeze the last extra 5% of peak performance out of it.

There seems to be some sort of incompatibility with something in 20.04. Maybe its something with the version of python3, maybe its lower down like the kernel and c-states. Don’t know. Back on 18.04 and it’s working fine again.

Is it possible to run the offending program from a 20.04 live iso after installing numpy and so into ram?

(Pretty much everything is different between 18.04 and 20.04 … but not as much different between 20.04 live iso and 20.04 installed)

Uuuuuuhhmmm, I think that would make the rounds but I haven’t seen anything so far. Could be, sure… buuuut I would still not fully dismiss the idea of the board just being a bit crap. 18.04 might just mask it better by being less optimised and less snappy overall.

I recently had a similar RAM instability issue on a system running Ryzen 3950X on ASUS PRIME X570-PRO with 128GB DDR4 (GSKILL F4-3600C18Q-128GVK). Both RAM and motherboard were listed in each other’s QVL.
The system would sometimes run memtest for hours with no errors, only to crash the next day, sometimes it would consistently show errors in memtest in test 7. It was driving me nuts.
I did a lot of tests (enable/disable DOCP, adjusting RAM speed, adjusting AMD memory controller voltage (SoC), test DIMMs on an Intel system) and the conclusion was that the speeds and voltage did not cause the instability, since the issue would show up at both low and high speed configs, it was only happening on the AMD system, and problem went away when I swapped DIMMs around … Now the RAM was stable for weeks with everything default and DOCP (3600MHz) enabled.
Let me tell you … I’ll never touch those DIMMs ever again.

There have been a lot of reported issues with 3000 series ryzen having stability issues and memory errors working with Corsair LPX memory, I don’t remember the exact cause I think it had something to do with the memory controllers not liking each other, but have you tested a different kit of ram?