Unstable threadripper system

I have a threadripper 2990wx in an aorus xtreme x399. I was running 128 GB of gskill dram, but removed 64 GB until memtest86 returned zero errors.

Despite memtest86 returning zero errors after having been run for days, I am still dealing with system instability issues. These issues look like the following:

  • Compiling large code bases often ends in a segfault (using either gcc or clang). Often a build will segfault, i’ll restart it, it will segfault again, restart and then the build will finish.
  • gzip often will fail CRC checks, but if I run the same job on another machine it will succeed.
  • Numerical subroutines will return different results on this system compared to ec2 ubuntu systems.
  • File downloads using S3 clients will not hash to the same contents as in the remote storage, but sometimes the hashes match.

All of this screams bad memory, but my memory seems to pass checks. There’s no overheating or anything bizarre in the kernel logs.

What should I be doing to profile my system and debug this? I am running Ubuntu focal.

Try running badblocks to rule out storage problems? If it was that bad I’d expect to see other things failing also, but maybe all the stuff you’ve listed are “other things”. :rofl:

1 Like

Thanks i’ll give it a go. It’s bad enough to where I can’t run my scientific computing workload on my machine because I don’t trust it. I also can’t reliably compile the software that I need to compile.

Yeah, I mean it’s weird that the OS seems exempt from the problems you’re having with your workloads, but maybe the OS alone just isn’t stressing the hardware enough to bring it out.

It does tend to happen when running under a full load. For instance, if compiling a code base on 1 thread, then it usually works fine, but if I use all 64 threads it tends to become unstable.

Zero errors from badblocks.

I’ve never seen Memtest86 catch unstable memory clocks or even faulty memory. Try Memtest64 under Windows, that tends to load the CPU and memory controller more, and does a better job of producing errors.
If you’re using XMP/DOCP, try turning it off and running JEDEC settings.

It could be that the memory it’s self is “fine” in that any given bit can be flipped correctly and will hold it’s data, but under load, the memory or controller aren’t holding up to the stress.

1 Like

Interesting. Although, I do not have Windows unfortunately.

It could be that the memory it’s self is “fine” in that any given bit can be flipped correctly and will hold it’s data, but under load, the memory or controller aren’t holding up to the stress.

Something like this sounds plausible to me. How would I go about testing the memory controller?

P.S. I had just checked and XMP was already set to disabled.

I’m not sure how well Memtest64 in Wine does for reporting memory errors, but I know I was able to get it to run at least.
Unfortunately, because overclocking hasn’t been popular in Linux, the tools for it, including for testing hardware stability aren’t really all that great. Intel Burn Test and Memtest64 are what I’ve found to be the most effective stress tests.
That said, since some time ago, Memtest64 has been problematic for testing larger amounts of memory in Windows anyway, so you may just try getting it to work in Wine.

Ran memtester, and I am seeing tons of errors. So it’s definitely either the memory modules or the motherboard.

root@threadripper-1:~# sudo memtester 60000M 1
memtester version 4.3.0 (64-bit)
Copyright © 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 60000MB (62914560000 bytes)
got 60000MB (62914560000 bytes), trying mlock …Killed
root@threadripper-1:~# sudo memtester 59000M 1
memtester version 4.3.0 (64-bit)
Copyright © 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).

pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 59000MB (61865984000 bytes)
got 59000MB (61865984000 bytes), trying mlock …locked.
Loop 1/1:
Stuck Address : testing 0FAILURE: possible bad address line at offset 0x4347bda90.
Skipping to next test…
Random Value : FAILURE: 0xafa7745577dbe688 != 0xa7a7745577dbe688 at offset 0x4347bdc90.
FAILURE: 0x34a9efa8799d164 != 0x24a9efa8799d164 at offset 0x4347baed8.
FAILURE: 0x42c2db8994043b7e != 0x4ac2db8994043b7e at offset 0x4347bda58.
FAILURE: 0x1a1a5b55bc2012cb != 0x121a5b55bc2012cb at offset 0x4347bdc90.
Compare XOR : FAILURE: 0x958c43ce8c99dbc4 != 0x938c43ce8c99dbc4 at offset 0x4347baed8.
FAILURE: 0xd404805d990445de != 0xdc04805d990445de at offset 0x4347bda58.
FAILURE: 0xab5c0029c1201d2b != 0xa35c0029c1201d2b at offset 0x4347bdc90.
Compare SUB : FAILURE: 0x6714f61287165c34 != 0xad14f61287165c34 at offset 0x4347baed8.
FAILURE: 0x3ededaea80c43ea6 != 0x26dedaea80c43ea6 at offset 0x4347bda58.
FAILURE: 0xfb60793870a5811f != 0x1360793870a5811f at offset 0x4347bdc90.
FAILURE: 0x6267afb962b7b46b != 0x4067afb962b7b46b at offset 0x4347be0d0.
Compare MUL : FAILURE: 0x100000000000000 != 0x00000001 at offset 0x4347baed8.
FAILURE: 0x800000000000001 != 0x00000001 at offset 0x4347bda90.
FAILURE: 0x800000000000001 != 0x00000000 at offset 0x4347bdc90.
Compare DIV : FAILURE: 0xf7dbee287ff6e642 != 0xf6dbee287ff6e643 at offset 0x4347baed8.
FAILURE: 0xfedbee287ff6e643 != 0xf6dbee287ff6e643 at offset 0x4347bda90.
FAILURE: 0xfedbee287ff6e643 != 0xf6dbee287ff6e642 at offset 0x4347bdc90.
Compare OR : FAILURE: 0xe7dbec082374c402 != 0xe6dbec082374c403 at offset 0x4347baed8.
FAILURE: 0xeedbec082374c403 != 0xe6dbec082374c403 at offset 0x4347bd8d8.
FAILURE: 0xeedbec082374c403 != 0xe6dbec082374c403 at offset 0x4347bda90.
FAILURE: 0xeedbec082374c403 != 0xe6dbec082374c402 at offset 0x4347bdc90.
Compare AND : FAILURE: 0xf3c2c3e174396c1a != 0xfbc2c3e174396c1a at offset 0x4347bde10.
Sequential Increment: Solid Bits : testing 0FAILURE: 0xf7ffffffffffffff != 0xffffffffffffffff at offset 0x4347bde10.
FAILURE: 0x800000000000000 != 0x00000000 at offset 0x4347c1288.
Block Sequential : testing 0FAILURE: 0x100000000000000 != 0x00000000 at offset 0x4347baed8.
FAILURE: 0x800000000000000 != 0x00000000 at offset 0x4347bda90.
FAILURE: 0x800000000000000 != 0x00000000 at offset 0x4347bdc90.
Checkerboard : testing 0FAILURE: 0x5d55555555555555 != 0x5555555555555555 at offset 0x4347bda90.
FAILURE: 0x5d55555555555555 != 0x5555555555555555 at offset 0x4347bdc90.
FAILURE: 0xa2aaaaaaaaaaaaaa != 0xaaaaaaaaaaaaaaaa at offset 0x4347c0c88.
Bit Spread : testing 0FAILURE: 0x800000000000005 != 0x00000005 at offset 0x4347bda90.
FAILURE: 0x800000000000005 != 0x00000005 at offset 0x4347bdc90.
FAILURE: 0x800000000000005 != 0x00000005 at offset 0x4347c10c0.
Bit Flip : testing 0FAILURE: 0x100000000000001 != 0x00000001 at offset 0x4347baed8.
FAILURE: 0x800000000000001 != 0x00000001 at offset 0x4347bd8d8.
FAILURE: 0xf7fffffffffffffe != 0xfffffffffffffffe at offset 0x4347bde10.
FAILURE: 0x800000000000001 != 0x00000001 at offset 0x4347c1288.
Walking Ones : testing 0FAILURE: 0xf7fffffffffffffe != 0xfffffffffffffffe at offset 0x4347bde10.
Walking Zeroes : testing 0FAILURE: 0x800000000000001 != 0x00000001 at offset 0x4347bda90.
FAILURE: 0x800000000000001 != 0x00000001 at offset 0x4347bdc90.
8-bit Writes : -FAILURE: 0x6d6d839d4e6e9656 != 0x6c6d839d4e6e9656 at offset 0x4347baed8.
FAILURE: 0xb6eef64059bf4887 != 0xbeeef64059bf4887 at offset 0x4347bde10.
16-bit Writes : |FAILURE: 0x7bc9b074affd0aa0 != 0x7ac9b074affd0aa0 at offset 0x4347baed8.

Or compatibility between the two. That’s definitely a thing, probably in large part due to AMD just not having anything DDR4 for so long.
I’d RMA the sticks, buy replacements, and then sell the replacements locally. Getting 32GB dimms is also not a bad idea, since there generally weren’t 32GB udimms before Ryzen launched, and going for 1.2v memory rather than higher voltage memory intended for XMP/DOCP.
Might also get ECC if your work needs to be reliably correct.

I thought about getting ECC, but last time I checked the probability of a flip is extremely low. So I don’t think that will be a serious issue for my experiments.

One thing I decided to do was to run sensors to see if there’s anything else that might be causing the stability issues before I just order a new 128 GB ram kit.

I upgraded my kernel to 5.8.0-50-generic and ran sensors:

it8792-isa-0a60
Adapter: ISA adapter
in0: 785.00 mV (min = +0.00 V, max = +2.78 V)
in1: 1.50 V (min = +0.00 V, max = +2.78 V)
in2: 1.04 V (min = +0.00 V, max = +2.78 V)
in3: 1.97 V (min = +0.00 V, max = +2.78 V)
in4: 1.80 V (min = +0.00 V, max = +2.78 V)
in5: 1.50 V (min = +0.00 V, max = +2.78 V)
in6: 2.78 V (min = +0.00 V, max = +2.78 V) ALARM
3VSB: 1.66 V (min = +0.00 V, max = +2.78 V)
Vbat: 1.60 V
fan1: 1483 RPM (min = 0 RPM)
fan2: 1527 RPM (min = 0 RPM)
fan3: 1366 RPM (min = 0 RPM)
temp1: +37.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
temp2: -55.0°C (low = +127.0°C, high = +127.0°C) sensor = Intel PECI
temp3: +31.0°C (low = +127.0°C, high = +127.0°C) sensor = thermistor
intrusion0: ALARM

k10temp-pci-00db
Adapter: PCI adapter
Tctl: +59.2°C
Tdie: +32.2°C

k10temp-pci-00cb
Adapter: PCI adapter
Tctl: +58.8°C
Tdie: +31.8°C

enp7s0-pci-0700
Adapter: PCI adapter
PHY Temperature: +55.0°C

nvme-pci-4100
Adapter: PCI adapter
Composite: +43.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +43.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +41.9°C (low = -273.1°C, high = +65261.8°C)

iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: N/A

k10temp-pci-00d3
Adapter: PCI adapter
Tctl: +58.5°C
Tdie: +31.5°C

k10temp-pci-00c3
Adapter: PCI adapter
Tctl: +61.9°C
Tdie: +34.9°C
Tccd1: +58.2°C
Tccd2: +58.5°C
Tccd3: +59.8°C

nvme-pci-0900
Adapter: PCI adapter
Composite: +36.9°C (low = -273.1°C, high = +84.8°C)
(crit = +84.8°C)
Sensor 1: +36.9°C (low = -273.1°C, high = +65261.8°C)
Sensor 2: +40.9°C (low = -273.1°C, high = +65261.8°C)

A lot of bizarre stuff here (negative temps???), and I have not mapped out what each of these are yet.

EDIT: Ran a build of LLVM on full load and it failed. Temps didn’t exceed 32C. Booted into memtest86 and there are now tons of errors.

One thing that is bizarre here is that I:

  1. Removed half the ram modules when the system first became unstable months ago.
  2. memtest86 returns zero errors
  3. system runs stable for a couple months
  4. system returns to unstable and now modules don’t pass memtest

Why would ram break after passing tests? The modules passed memtest on arrival as well. Possibly misconfigured or broken mobo breaking dram modules?

While it may “scream bad memory”, something is whispering “power supply”.

Sorry to bump this, but I forgot to close with results. Right after my lost post I purchased a new 128GB kit, and everything stabilized. Been using it for a while and run memtests regularly and still zero failures.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.