Ryzen 7 1700 machine check error - instant reboot

So my Fedora 28 box just rebooted while idle for no apparent reason. I was reading level1techs not even touching the mouse and it spontaneously rebooted.

journalctl shows these gems:

Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: Machine check events logged
Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb28b9f22 MISC d012000101000000 SYND 4d000000 IPID 500b000000000
Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1528443020 SOCKET 0 APIC 4 microcode 8001129

Checking mcelog nets me this:

mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor. Please use the edac_mce_amd module instead.
CPU is unsupported

Trolling the interwebs results in several suggestions: bad memory, early SKU defective Ryzen, power management problem, PSU issue.

This is a pre-built Dell, though I have added an extra 16GB DDR4-2400 with identical manufacturer (Kingston) and timings as the original DIMM. The PSU has been recently upgraded from the craptastic Dell 460w to a Corsair CX650M.

I have been using this system for virtualization, so it’s possible that I’ve misconfigured something. I configured these kernel options configured in GRUB to get GPU passthrough working:

rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 resume=UUID=e2e831d7-33f5-4868-8976-4710eac595bf amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=10de:1c82,10de:0fb9 video=efifb:off rhgb quiet

I’ve got to admit I’m not familiar with some of these options, but they seemed to be working fine for a few weeks. The GTX 1050 Ti that was installed, is currently removed. I’m contemplating getting a more powerful GPU for running Windows games. Currently only an RX 580 is installed.

Before I go down the rabbit hole, any suggestions on what I should start testing in what order?

Personally I always disable C-States in the UEFI. Don’t know if you can on a Dell board.

Nope. I can disable some S states in BIOS, but that’s it. I notice that there is a “resume” parameter in the kernel options that I didn’t put there. Could that be a problem?

So this happened again today. I researched this problem on AMD’s forum:
https://community.amd.com/thread/216084?start=180&tstart=0

and one possible fix is to disable C-states, in particular C6. My BIOS is locked down hard, but this script does allow for the disabling of C6:

So I’m going to give it a try and see if that eliminates the problem. The curious thing is that Windows 10 seems to operate fine. But I do use my Fedora partition a lot more than my Windows one.

First thing to do is just reseating your CPU. And honestly if you still get the issue RMA the chip. Chances are good that you just have a faulty early production run CPU.

The MCE hard reset is a known silicon level problem. I don’t recommend the C6 workaround for this, since essentially your CPU can not be trusted if you have this issue.

It may sound scary, but do try and approach AMD support about an RMA.

Hmm, I bought the CPU as part of a pre-built. Would AMD issue an RMA for an OEM part? If not, I’m kind of screwed as this problem only seems to happen in Windows. And Dell’s position is that if the problem is only occurring under Linux on a system designed for Windows you’re SOL.

Give the C6 workaround a try.
But also at the same time contact AMD and Dell about an RMA. Maybe even CC them :joy:

The OS really shouldn’t matter if the hardware is potentially defective. It can’t hurt to try and be persistent about it. Make mention of the Linux Ryzen SEGV bug.

You can also try and run the ryzen kill script

You shouldn’t have to do these workarounds to get usable hardware with a good Ryzen CPU.

Well, I’d try to contact AMD, but I get this when I try go to the contact webform:

emailcustomercare.amd.com uses an invalid security certificate.

The certificate is not trusted because the issuer certificate is unknown. The server might not be sending the appropriate intermediate certificates. An additional root certificate may need to be imported.

Error code: SEC_ERROR_UNKNOWN_ISSUER

LOL, that’s one way to keep your RMA costs down…

Just mail direct to [email protected]

OR

https://support.amd.com/en-us/contact/email-form

And you will get a ticket number back. Which a rep will respond to later at some point.

Also taking this into consideration you may need to go via Dell.
https://support.amd.com/en-us/warranty/oem

Works fine here

The site loaded in Chromium OK, but then the form timed out. But I got a confirmation email from AMD anyway. We’ll see if they have any suggestions, or offer an RMA.

I agree that this should be no problem in the first place.
But maybe give processor.max_cstate=1 a shot?
Works for my laptop. :wink:

Thanks, I’ll give it a shot. But doesn’t that kill your battery life? Have you RMA’d it?

Well, it does shorten the battery life a little probably but it is still clocking down just sitting there doing nothing. I won’t RMA, it works perfectly otherwise and on a laptop … if that’s all that is wrong with it, I take it. :wink:

In @noenken’s case its a different kernel related issue and not a hardware problem.

I also had a defective R7 1700X with the MCE issue as you do. It doesn’t just crash under linux. Under the right circumstances that also would occur in windows.

noenkens issue is related to how the linux kernel handles C-states and power management of the Raven Ridge Vega GPU and PCI-e bus when C-state changes occur.

Essentially the linux kernel developers haven’t got all the software code in place to make it work right.

1 Like

Oh wow, I didn’t realize that. I thought it was also something that would be done in a BIOS update at some point. Yeah, if it’s hardware then of course be loud about it, @imrazor.

MCE’s almost always (99.99%) means there’s a hardware issue.

1 Like

Great, I hate trying to coerce support out of companies like Dell and AMD. Since it’s a consumer product and not an enterprise PC, it’ll be even harder. I should’ve gotten a workstation, but I don’t think there are any Ryzen workstations out there yet.

I’ll wait for a response from AMD, then I’ll ping Dell and see if i get any useful response. In the meantime, I’ll try disabling C6 with the zenstates.py script, then all C-states with @noenken’s kernel option if that fails.

If I’m forced to, would getting a brand new retail Ryzen gen 1. fix the issue?

EDIT: Oh, one other thing. Could this be a RAM issue? I’ve seen that mentioned in other threads.

I’m almost 99.99% sure it’s not a RAM issue. If you tell me this also happens when your RAM is running at JEDEC (stock) spec then that makes the last 0.01% which makes it 100% sure. :wink:

Yep, my RAM is running at JEDEC speed (DDR4-2400). I don’t even have the option of overclocking RAM or XMP in the BIOS.

Got a prompt response from AMD. They want my serial number. Is there any way to get it without tearing apart the PC and removing the Dell cooler?