Debugging a crashy server

That’s a good question.

Is your BIOS up to date?

I’ve only ever encountered this on a Lenovo 8th gen Intel laptop. I wasn’t able to solve it, so I had to send it back.

Yea… this is the second CPU this bloody computer had, with this exact problem.

Sorry, I should have verified that this is not likely a CPU problem.

What specific firmware version are you running?

I just updated the BIOS to the latest version (although I was already running the AGEISA 2.0 patches), and installed the amd-ucode package I missed during installation.

I think the T-test on this is probably 2 weeks without a crash

If updates don’t stop it, try a different power supply.
Doesn’t necessarily need to be more reputable or more powerful, as long as it’s internals are different. Consider swapping with a different PC.

you can try

  1. BIOS Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control << Typical Current Idle
  2. compile kernel with CONFIG_RCU_NOCB_CPU=y and grub rcu_nocbs=0-12
  3. disable C6

https://bugzilla.kernel.org/show_bug.cgi?id=196683