Threadripper 3970X and Linux hardware error

Hi all.

I just bought a Threadripper 3970X together with ASRock TRX40 Taichi motherboard. This system is my upgrade from a Threadripper 1950X, which has served me very well. I run Fedora 31 on both systems.

During attempts to get GPU passthrough working (without success) on the new system, I have received the following Linux kernel message: (image attached)

feb. 15 22:26:48 oddstr kernel: mce: [Hardware Error]: Machine check events logged
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: Deferred error, no action required.
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: CPU:2 (17:31:0) MC22_STATUS[-|-|MiscV|-|-|-|SyndV|Deferred|-|-]: 0x982010000001010b
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: IPID: 0x0000001813d17000, Syndrome: 0x000000004b00000c
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: Northbridge IO Unit Ext. Error Code: 1, PCIE error.
feb. 15 22:26:48 oddstr kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: GEN

I seem unable to find information on what this actually means. Does anyone have enough insight into the CPUs MCA registers and can tell me whether my 3970X is broken? If not, suggestions on where I can ask is also welcome.

Best regards,
Odd Skancke

https://www.phoronix.com/scan.php?page=news_item&px=Linux-Boot-Threadripper-Zen2MCE

1 Like

I should have mentioned that I was running with custombuilt v5.5.0-rc3 version of the kernel at the time. So this is not the MCE issue - which prevented successful boot, unless I have missed something.

What I am after is whether anyone know how to read and interpret the hardware error message. I think I could figure it out myself if I knew where to find info about the different MCA CPU registers. I just want to make sure that my CPU is physically OK :slight_smile:

Kernel mailing list maybe? Are you getting these errors a lot, or when running stress tests?

I got those errors when attempting to pass through a RX Radeon 570 to a Windows 10 guest - which works perfectly on the 1950X rig.

Since I had to go back to the 1950X rig, I cannot do any tests atm. But no, I did not stress the system - just attempted to boot the Win10 guest.

I’ve ordered a Asus Zenith Extreme II board - I’m hoping this will work for med as I’ve seen others with that board have success with passthrogh.

Just in case someone might be interested and can enlighten me, here is a pic of dmesgs output during one of those attempts.

Alas, I do not have experience in this area and I don’t know The Answer. But it interests me, and I offer a few points for your consideration.

If you add the “helpdesk” tag to your post, I think it may draw attention from helpful experts on the forum who watch for that tag.

This is an MCE (Machine Check Exception) error, and you might want to google “Linux MCE” and “Linux mcelog” for some general info (if you haven’t already). I learned that, depending on your distribution, some of the following may apply:

  • The mcelog manpage may shed light on the situation;
  • You may want to install the mcelog package (or similar) for additional tool(s);
  • Info may be written into /var/log/mcelog;
  • “sudo /usr/sbin/mcelog > logfile” may provide more info. But the data is discarded (by a daemon?) once read by mcelog, so be sure to save the output into a log file. Of course, the syntax may vary on your distribution.

Looking over the dmesg output, I notice something you may already be aware of. There are 3 very similar MCE errors, all reported as “PCIE error” with the same status, address, syndrome, etc. Essentially a single error repeated three times. The MCE error at 123.735459 follows setup for a vfio-pci device (possibly passthrough for the PCIE bus?), and immediately follows “enabling device” for that device.

To my non-expert eye, this does not strongly indicate a CPU failure, but makes me wonder about possible configuration issues for the passthrough. Still, I understand why you particularly want to determine whether the CPU is at fault. You might learn more by exercising virtualization in other ways, without passthrough, to see whether any other errors crop up.

I regret this isn’t more definitive, and hope some more knowledgeable folks will take notice and chime in.