Mystery System Instability - Threadripper PRO 9995WX - MCE Hardware Error

Hi all,

First post here and apoligize for the wall of text. I’m hoping to get a some help opinion on a deeply frustrating stability issue with a new HEDT build. I’ve been troubleshooting this for a while and have reached a point where the evidence seems to point to a core hardware fault, but I’d appreciate some expert eyes on the data before I proceed with an RMA.

System Specs:

  • CPU: AMD Ryzen Threadripper Pro 9995WX
  • Motherboard: ASUS Pro WS TRX50-SAGE WIFI (Latest BIOS installed)
  • RAM: V-COLOR DDR5 1024GB (256GBx4) 5600MHz CL38 16Gx4 8Rx4 OC R-DIMM (EXPO disabled, passed MemTest86+)
  • GPUs: 2x RTX 6000 Pro Blackwell Workstation Edition
  • Storage: 4 TB Samsung 9100 Pro
  • PSU: 2x Corsair HX1500i
  • OS: Ubuntu 24.04.03 LTS

The system experiences hard freezes and spontaneous reboots, primarily when under a heavy load (especially GPU-related tasks, but also all-core CPU stress tests). The crashes are often so sudden that no logs are written. This happens on a clean installation of Ubuntu. Initially, I couldn’t even get the Ubuntu 24.04 installer to boot without crashing so I had to resort to initally installing Ubuntu 22.04 then doing an in-place upgrade to 24.04.

After one of the crashes, I managed to pull the following Boot Error Record Table log from the previous boot’s journal. This is the most concrete evidence I have of a hardware-level fault.

BERT: Error records from previous boot:
[Hardware Error]: event severity: fatal
[Hardware Error]: Error 0, type: fatal
[Hardware Error]: fru_text: Perr S0:T09F:B06
[Hardware Error]: section_type: IA32/X64 processor error
[Hardware Error]: Error Structure Type: cache error
[Hardware Error]: Processor Context Corrupt: true
[Hardware Error]: Uncorrected: true

mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 175: Machine Check: 0 Bank 6: baa0000000000118
mce: [Hardware Error]: PROCESSOR 2:b00f81 TIME 1758090392 SOCKET 0 APIC 9f microcode b008112

I have been through an exhaustive process of elimination:

  1. OS & Drivers: Tried installing Ubuntu 24.04 and 22.04. The issue persists. Tried all available NVIDIA drivers for the Blackwell GPU (570-open, 575-open, 580-open) with clean purges in between.
  2. Kernel Parameters: Tested numerous boot flags (nomodeset, acpi=off, nvme_core.default_ps_max_latency_us=0, etc.) with no success.
  3. BIOS: Updated to the latest motherboard BIOS from ASUS. Configured with known-good settings for Linux (CSM off, 4G Decoding On/Off, Resizable BAR On/Off, etc.).
  4. Hardware:
  • SSD: The issue persists even after swapping the primary NVMe drive with a brand new one from a different manufacturer.
  • RAM: Ran MemTest86+ for a couple hours with zero errors. The system is also running with EXPO disabled.
  1. Mechanical Checks: Re-seated the CPU and carefully re-torqued the CPU cooler mounting bracket after suspecting uneven pressure. This seemed to improve stability temporarily, but the system eventually crashed again.
  2. Power & Thermals: The CPU does not exceed 70°C under full load. The dual 1500W PSU setup is should be sufficent to power the machine.
  3. Core Isolation: Limiting the CPU to 32 cores allowed one specific GPU-heavy workload to complete, but this isn’t a definitive test.

Given the explicit “processor error” and “cache error” in the BERT log, is it safe to conclude this is a faulty CPU, despite the exceedingly low failure rate of Threadripper Pro chips?

At this point, does the evidence point more strongly to the CPU or the motherboard? I can’t swap either easily, so any insight on which is the more likely culprit would be incredibly helpful for a potential RMA.

Thanks in advance for any insight you can provide. This has been a tough one.

1 Like

Honestly I am not sure.
Error msg points to CPU, but it might be power related (since it only occurs during high power load).

What I’d do in your situation:

  • try y-cruncher or prime95 (or sth else that stresses CPU caches)
  • check if this board has BMC, if it does, try to access it and access System Event Log (SEL). This can be done with ipmitool on linux.

Can you connect both GPUs to only one PSU? So that the other one can run cpu/mb “undisturbed”.

1 Like

Appreciate the response!

Going to run prime95 overnight to see if I can force a crash but given the fact that limiting cores to 32 has been stable when SSHed in all day today I’m leaning towards the CPU being faulty.

The motherbaord does have BMC but I belive it’s disabled so I’ll see if I can get that working then check the SEL like you suggest.

I’ll try both GPU’s on one PSU but the Pro WS TRX50-SAGE WIFI A has a very specific dual PSU setup in the manual that has the 24 pin connector on PSU 2 be not jumped but instead connected to the motherboard in some way so it’s hard to get a “clean” dual PSU setup.

Thanks a ton!

TRX50-SAGE doesn’t have onboard BMC chip. Unless you’ve installed the IPMI extension card, you won’t be able to use the related functions. Monitoring DRAM & SSD temperature during the work might also help.

Ah you are right I had it confused with my Pro WS WRX80E-SAGE SE WIFI II build.

I am not familiar with this kind of hardware. On servers one can usually choose PSU setup between load-balance and hot-spare. Maybe there is an option in BIOS?

maybe… but at the same time - 32 cores use less power than all cores.
This is quite difficult to reason about.
My main worry about PSUs is that “1500 watts” doesn’t tell the whole story - it is a total, so there might still be some circuits/connections with far less capability.

Given that this is a high-end CPU, I’d probably try to contact AMD support and MB support and give them all the info - especially the BERT messages.
At the same time, if You can “rotate” which cores / CCDs are enabled, and test them, then maybe one of those will behave abnormally… lots of work and time, sadly.

1 Like

I am encountering mystery reboots on a TR 9970X system:

CPU: 9970X. No overclock.
MB: ASRock TRX50 WS - 13.05 BIOS
RAM: 384 GB RAM (4x Micron MTC40F204WS1RC56BR 96GB)
GPUs: 1x RTX5090, 2x RTX3090, 1x ARC A310
NVMe: Kioxia CM7-R, 6x M.2
PSU: Silverstone 2050R (Based in UK)
SATA: 8x 20 TB hard disks

Just at the start of trying to diagnose this one. Will update on findings, but it sounds similar to your situation.

On the most recent reboot, my system had been running at 100% load (multiple AutoGluon models being fitted) for over 24 hours.

If not overclocked, RMA the CPU. This machine check is for an MCA error in the processor cache.

Thanks, it just failed an overnight prime95 blend… Time for RMA.

Good job I haven’t sold the 7960X that was previously in use.

Hope you find a solution, @kearm !

you guys checking the dimm temps? have seen me from dimm over temp for extended periods…

Thanks for the input.

Unfortunately my modules don’t seem to report temps. I have used an infrared thermometer and the highest temperature reported was in the low 50 degrees C. I have all fans at full speed, although I am using the arctic freezer 4u-m, so space is a little tight in there.

I should note that upon closer inspection in my case, the first instance of rebooting (under 100% load) was followed by 3 further unprompted reboots (under no significant load) within an hour…

1 Like

Thought I would post an update here. @kearm did you get a solution in the end?

I swapped my 7960X into the system and the reboots continued.

I tested memtest86+; no errors after 4 passes. May still try ycruncher when I get time.

Interestingly, I tested prime95 in an Ubuntu 24.04.3 LTS install with the 6.8 kernel; stable for 24 hours. So I installed 24.04.3 LTS w/ 6.8 kernel. I then upgraded to 6.14; reboots returned…

Back to 6.8 and stable under heavy CPU, memory and triple GPU load (XGBoost model training on all three GPUs, with AutoGluon model training on CPU in parallel).

I am wondering what could be causing this instability on 6.14?

EDIT: Just had a lockup on 6.8, so definitely looking like a hardware issue.

Well, I stand corrected! Thank you for your suggestion!

I replaced the Freezer 4u-m with the XE360-TR5, giving me more room to direct air over the dimms. I mounted a 140mm Noctua and 2x 120mm Arctic fans blowing air at and across them. Stable for the past 60 hours on 6.14 kernel using y-cruncher, GPU-burn, AutoGluon, various genomic search tools.

Frustrating that some of these dimms don’t report temperature.

Interestingly, when I ran y-cruncher and gpu-burn in parallel (one RTX5090 and 2x RTX3090’s), I started seeing some corrected ECC errors pop up. The system draws 1.8-1.9 kW under those conditions, so potentially coming close to the PSU limit (Silverstone HELA 2050 - based in UK). or something else…

1 Like

Hello,

We’ve encountered the same issue when running LLMs using inference frameworks like vLLM or Sglang in a multi GPU configuration using two NVIDIA RTX PRO 6000 Blackwell Max-Q GPUs. After testing LLMs, when I attempt to shut down the machine, either via sudo shutdown now or the desktop UI Power off, it occasionally reboots instead of powering off. After it reboots once, I am usually able to shut it down normally. The issue is non-deterministic. It sometimes shuts down correctly, but other times it triggers a restart. We tested on the four machines with below configuration. The same issue on all machines. Please help to fix it.

  • Motherboard: Gibabyte TRX50 AI TOP
  • CPU: AMD Ryzen Threadripper 9960X 24-Cores
  • GPU: 2xNVIDIA RTX PRO 6000 Blackwell Max-Q
  • PSU: FSP2500-57APB
  • OS: Ubuntu 24.04.3 LTS
  • Kernel: 6.14.0-37-generic

Here is what appears after an unsuccessful shutdown:

Full Error log:

Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: DMI: Memory slots populated: 4/8
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FACP table memory at [mem 0x55e6d000-0x55e6d113]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving DSDT table memory at [mem 0x55e5a000-0x55e6c3b4]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FACS table memory at [mem 0x57e5c000-0x57e5c03f]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving MSDM table memory at [mem 0x55e7c000-0x55e7c054]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HWIN table memory at [mem 0x55e7b000-0x55e7b06b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e70000-0x55e7a8dc]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e6f000-0x55e6fa1f]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e6e000-0x55e6e066]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FIDT table memory at [mem 0x55e59000-0x55e5909b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving MCFG table memory at [mem 0x55e58000-0x55e5803b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e57000-0x55e575c4]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HPET table memory at [mem 0x55e56000-0x55e56037]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e55000-0x55e552e6]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving BERT table memory at [mem 0x55e54000-0x55e5402f]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HEST table memory at [mem 0x55e53000-0x55e53a13]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving PHAT table memory at [mem 0x55e52000-0x55e5210a]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving ABLT table memory at [mem 0x55e51000-0x55e512c1]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FPDT table memory at [mem 0x55e50000-0x55e50043]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e4d000-0x55e4f447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e4a000-0x55e4c447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e47000-0x55e49447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e44000-0x55e46447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e41000-0x55e43447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e3e000-0x55e40447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e3b000-0x55e3d447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e38000-0x55e3a447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving BGRT table memory at [mem 0x55e37000-0x55e37037]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e2b000-0x55e3696d]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HMAT table memory at [mem 0x55e2a000-0x55e2a0a3]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving TPM2 table memory at [mem 0x55e29000-0x55e2904b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving IVRS table memory at [mem 0x55e28000-0x55e28267]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving WPBT table memory at [mem 0x55cce000-0x55cce033]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55ccd000-0x55ccd6d3]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cc3000-0x55cccaf1]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving WSMT table memory at [mem 0x55cc2000-0x55cc2027]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving APIC table memory at [mem 0x55cc0000-0x55cc10b7]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cbd000-0x55cbf6a3]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cbc000-0x55cbc4ff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cbb000-0x55cbb9cf]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cba000-0x55cbab71]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cb9000-0x55cb944d]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Early memory node ranges
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x04000000-0x04039fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x099ff000-0x09ffffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x0b000000-0x0b020fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x4a366000-0x4a45ffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x4a579000-0x4a579fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x4fc48000-0x5dffefff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x5f769000-0x5f76bfff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x5fffb000-0x67ffffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x6a700000-0x77ffffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x7d800000-0x10000ffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x407c000000-0x407c7fffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing SMP alternatives memory: 48K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Memory: 263101260K/267753680K available (21475K kernel code, 4587K rwdata, 15072K rodata, 5140K init, 4412K bss, 4598948K reserved, 0K cma-reserved)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: x86/mm: Memory block size: 2048MB
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing initrd memory: 71392K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: BERT: Error records from previous boot:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]: event severity: fatal
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  Error 0, type: fatal
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  fru_text: Perr S0:T000:B15
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   section_type: memory error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    error_status: Bus parity error - must also set the A, C, or D Bits (0x0000000000011600)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   node:0 card:0 module:0
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   error_type: 8, parity error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  Error 1, type: recoverable
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  fru_text: Perr S0:T000:B15
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   Local APIC_ID: 0x0
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   CPUID Info:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   00000000: 00b00f81 00000000 00300800 00000000
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   00000010: 76fa320b 00000000 178bfbff 00000000
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   Error Information Structure 0:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Error Structure Type: micro-architectural error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Check Information: 0x0000000000050001
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:     Error Type: 5, Internal Unclassified
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   Context Information Structure 0:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Register Array Size: 0x0080
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    MSR Address: 0xc0002151
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  Error 2, type: corrected
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  fru_text: Perr S0:T002:B16
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   section_type: memory error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    error_status: Bus parity error - must also set the A, C, or D Bits (0x0000000000011600)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   node:0 card:3 module:0
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   error_type: 8, parity error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: Machine check events logged
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 21: fea000000004080b
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: TSC 0 ADDR f6f2ada8d1 MISC d0150fff01000000 PPIN 2b0e2ec762dc05a SYND 5d000000 SYND1 3a30532072726550 SYND2 3531423a30303054 IPID 9600050f00
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: PROCESSOR 2:b00f81 TIME 1766471971 SOCKET 0 APIC 0 microcode b008112
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: RAS: Correctable Errors collector initialized.
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused decrypted memory: 2028K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused kernel image (initmem) memory: 5140K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused kernel image (text/rodata gap) memory: 1052K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused kernel image (rodata/data gap) memory: 1312K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB systemd[1]: Listening on systemd-oomd.socket - Userspace Out-Of-Memory (OOM) Killer Socket.
Dec 23 11:39:35 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: MCE: In-kernel MCE decoding enabled.

Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: DMI: Memory slots populated: 4/8
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FACP table memory at [mem 0x55e6d000-0x55e6d113]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving DSDT table memory at [mem 0x55e5a000-0x55e6c3b4]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FACS table memory at [mem 0x57e5c000-0x57e5c03f]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving MSDM table memory at [mem 0x55e7c000-0x55e7c054]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HWIN table memory at [mem 0x55e7b000-0x55e7b06b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e70000-0x55e7a8dc]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e6f000-0x55e6fa1f]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e6e000-0x55e6e066]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FIDT table memory at [mem 0x55e59000-0x55e5909b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving MCFG table memory at [mem 0x55e58000-0x55e5803b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e57000-0x55e575c4]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HPET table memory at [mem 0x55e56000-0x55e56037]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e55000-0x55e552e6]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving BERT table memory at [mem 0x55e54000-0x55e5402f]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HEST table memory at [mem 0x55e53000-0x55e53a13]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving PHAT table memory at [mem 0x55e52000-0x55e5210a]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving ABLT table memory at [mem 0x55e51000-0x55e512c1]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving FPDT table memory at [mem 0x55e50000-0x55e50043]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e4d000-0x55e4f447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e4a000-0x55e4c447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e47000-0x55e49447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e44000-0x55e46447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e41000-0x55e43447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e3e000-0x55e40447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e3b000-0x55e3d447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e38000-0x55e3a447]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving BGRT table memory at [mem 0x55e37000-0x55e37037]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55e2b000-0x55e3696d]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving HMAT table memory at [mem 0x55e2a000-0x55e2a0a3]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving TPM2 table memory at [mem 0x55e29000-0x55e2904b]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving IVRS table memory at [mem 0x55e28000-0x55e28267]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving WPBT table memory at [mem 0x55cce000-0x55cce033]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55ccd000-0x55ccd6d3]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cc3000-0x55cccaf1]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving WSMT table memory at [mem 0x55cc2000-0x55cc2027]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving APIC table memory at [mem 0x55cc0000-0x55cc10b7]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cbd000-0x55cbf6a3]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cbc000-0x55cbc4ff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cbb000-0x55cbb9cf]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cba000-0x55cbab71]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: ACPI: Reserving SSDT table memory at [mem 0x55cb9000-0x55cb944d]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Early memory node ranges
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000fffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x04000000-0x04039fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x099ff000-0x09ffffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x0b000000-0x0b020fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x4a366000-0x4a45ffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x4a579000-0x4a579fff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x4fc48000-0x5dffefff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x5f769000-0x5f76bfff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x5fffb000-0x67ffffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x6a700000-0x77ffffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x7d800000-0x10000ffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: PM: hibernation: Registered nosave memory: [mem 0x407c000000-0x407c7fffff]
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing SMP alternatives memory: 48K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Memory: 263101260K/267753680K available (21475K kernel code, 4587K rwdata, 15072K rodata, 5140K init, 4412K bss, 4598948K reserved, 0K cma-reserved)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: x86/mm: Memory block size: 2048MB
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing initrd memory: 71392K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: BERT: Error records from previous boot:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]: event severity: fatal
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  Error 0, type: fatal
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  fru_text: Perr S0:T000:B15
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   section_type: memory error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    error_status: Bus parity error - must also set the A, C, or D Bits (0x0000000000011600)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   node:0 card:0 module:0
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   error_type: 8, parity error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  Error 1, type: recoverable
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  fru_text: Perr S0:T000:B15
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   section_type: IA32/X64 processor error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   Local APIC_ID: 0x0
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   CPUID Info:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   00000000: 00b00f81 00000000 00300800 00000000
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   00000010: 76fa320b 00000000 178bfbff 00000000
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   00000020: 00000000 00000000 00000000 00000000
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   Error Information Structure 0:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Error Structure Type: micro-architectural error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Check Information: 0x0000000000050001
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:     Error Type: 5, Internal Unclassified
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   Context Information Structure 0:
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Register Context Type: MSR Registers (Machine Check and other MSRs)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    Register Array Size: 0x0080
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    MSR Address: 0xc0002151
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  Error 2, type: corrected
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:  fru_text: Perr S0:T002:B16
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   section_type: memory error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:    error_status: Bus parity error - must also set the A, C, or D Bits (0x0000000000011600)
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   node:0 card:3 module:0
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: [Hardware Error]:   error_type: 8, parity error
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: Machine check events logged
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 21: fea000000004080b
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: TSC 0 ADDR f6f2ada8d1 MISC d0150fff01000000 PPIN 2b0e2ec762dc05a SYND 5d000000 SYND1 3a30532072726550 SYND2 3531423a30303054 IPID 9600050f00
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: mce: [Hardware Error]: PROCESSOR 2:b00f81 TIME 1766471971 SOCKET 0 APIC 0 microcode b008112
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: RAS: Correctable Errors collector initialized.
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused decrypted memory: 2028K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused kernel image (initmem) memory: 5140K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused kernel image (text/rodata gap) memory: 1052K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: Freeing unused kernel image (rodata/data gap) memory: 1312K
Dec 23 11:39:34 admin2-TRX50-AI-TOP-ProArt-S0EB systemd[1]: Listening on systemd-oomd.socket - Userspace Out-Of-Memory (OOM) Killer Socket.
Dec 23 11:39:35 admin2-TRX50-AI-TOP-ProArt-S0EB kernel: MCE: In-kernel MCE decoding enabled.

Can you point a 140mm fan at the dimm area of the motherboard and retest/retry? anything in the output of lmsensors?

1 Like

Hi @wendell, the RAM is 256GB (Kingston KSM56R46BD4PMI-64MDI). We are using the ASUS ProArt PA602 case with the two fans inside, and we have four identical machines. Could the issue be related to the BIOS version? We are currently running version F10.

the memory could be throttling properly but give it more airflow to see if that changes the situation. just pop the side off and manually place a fan so there’s more airflow over the dimms just to see what happens

Hi, I have the same issue. Also have TRX50 AI Top (bios F12d), a ThreadripperPro 9985WX and 2x RTX5090s, memory is G.Skill F5-6000R3036G16GE8-G5N. I run it as a server with Proxmox. It shuts down when it wants, sometimes at first time sometimes a have to insist many times after a couple of self reboots.

1 Like

The issue was due to BIOS firmware. We updated it from the F10 to the f12e. Hope it helps.

1 Like

@Shakhizat_Nurgaliyev and @BigSound if you continue to have instability, could you report it here?

My system has been stable for two months now with substantial improvements to the airflow over the DIMMs.

1 Like