Ryzen 7 1700 machine check error - instant reboot

imrazor · June 8, 2018, 8:20am

So my Fedora 28 box just rebooted while idle for no apparent reason. I was reading level1techs not even touching the mouse and it spontaneously rebooted.

journalctl shows these gems:

Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: Machine check events logged
Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108
Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffb28b9f22 MISC d012000101000000 SYND 4d000000 IPID 500b000000000
Jun 08 02:30:24 localhost.localdomain kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1528443020 SOCKET 0 APIC 4 microcode 8001129

Checking mcelog nets me this:

mcelog: ERROR: AMD Processor family 23: mcelog does not support this processor. Please use the edac_mce_amd module instead.
CPU is unsupported

Trolling the interwebs results in several suggestions: bad memory, early SKU defective Ryzen, power management problem, PSU issue.

This is a pre-built Dell, though I have added an extra 16GB DDR4-2400 with identical manufacturer (Kingston) and timings as the original DIMM. The PSU has been recently upgraded from the craptastic Dell 460w to a Corsair CX650M.

I have been using this system for virtualization, so it’s possible that I’ve misconfigured something. I configured these kernel options configured in GRUB to get GPU passthrough working:

rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 resume=UUID=e2e831d7-33f5-4868-8976-4710eac595bf amd_iommu=on iommu=pt rd.driver.pre=vfio-pci vfio-pci.ids=10de:1c82,10de:0fb9 video=efifb:off rhgb quiet

I’ve got to admit I’m not familiar with some of these options, but they seemed to be working fine for a few weeks. The GTX 1050 Ti that was installed, is currently removed. I’m contemplating getting a more powerful GPU for running Windows games. Currently only an RX 580 is installed.

Before I go down the rabbit hole, any suggestions on what I should start testing in what order?

noenken · June 8, 2018, 8:34am

Personally I always disable C-States in the UEFI. Don’t know if you can on a Dell board.

imrazor · June 8, 2018, 8:49am

Nope. I can disable some S states in BIOS, but that’s it. I notice that there is a “resume” parameter in the kernel options that I didn’t put there. Could that be a problem?

imrazor · June 30, 2018, 9:18pm

So this happened again today. I researched this problem on AMD’s forum:
https://community.amd.com/thread/216084?start=180&tstart=0

and one possible fix is to disable C-states, in particular C6. My BIOS is locked down hard, but this script does allow for the disabling of C6:

Overclock your Ryzen CPU from Linux Wikis & How-to Guides

So you’ve probably heard of Ryzen Master? Except you happen to use Linux. What’s there to do? Well with this little python application and the help of MSR’s (Model Specific Registers) on Ryzen CPU’s you can manually overclock your Ryzen processor. While it’s running! 1. Load 2 kernel modules These should come as standard on any distro sudo modprobe msr cpuid 2. Download the following repo git clone https://github.com/r4m0n/ZenStates-Linux.git Now once that’s done it’s time to get well acquainted with your processor 3. Run zenstates sudo ./zenstates.py -l For a 1700X on stock settings you should see something like this P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000 P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500 P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000 P3 - Disabled P4 - Disabled P5 - Disabled P6 - Disabled P7 - Disabled C6 State - Package - Enabled C6 State - Core - Enabled For a stock 1800X it looks like this: P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000 P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500 P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000 P3 - Disabled P4 - Disabled P5 - Disabled P6 - Disabled P7 - Disabled C6 State - Package - Enabled C6 State - Core - Enabled Most Ryzen CPU’s only seem to support P-states P0, P1 and P2, keep this in mind. All values are set in hexadecimal values and require a bit of thought and math to use safely YOU CAN TOTALLY BREAK THINGS HERE! You and only you the reader take responsibility for performing any of the actions described here on your system. Don’t come crying to me if you entered the wrong values and flames came shooting out of your Hardware. Now what are some of the numbers we see above? FID – Frequency ID in Hexadecimal VID – Voltage ID in Hexadecimal DID – Divisor ID in Hexadecimal R…

So I’m going to give it a try and see if that eliminates the problem. The curious thing is that Windows 10 seems to operate fine. But I do use my Fedora partition a lot more than my Windows one.

catsay · June 30, 2018, 9:38pm

First thing to do is just reseating your CPU. And honestly if you still get the issue RMA the chip. Chances are good that you just have a faulty early production run CPU.

The MCE hard reset is a known silicon level problem. I don’t recommend the C6 workaround for this, since essentially your CPU can not be trusted if you have this issue.

It may sound scary, but do try and approach AMD support about an RMA.

imrazor · June 30, 2018, 9:40pm

Hmm, I bought the CPU as part of a pre-built. Would AMD issue an RMA for an OEM part? If not, I’m kind of screwed as this problem only seems to happen in Windows. And Dell’s position is that if the problem is only occurring under Linux on a system designed for Windows you’re SOL.

catsay · June 30, 2018, 9:46pm

Give the C6 workaround a try.
But also at the same time contact AMD and Dell about an RMA. Maybe even CC them

The OS really shouldn’t matter if the hardware is potentially defective. It can’t hurt to try and be persistent about it. Make mention of the Linux Ryzen SEGV bug.

You can also try and run the ryzen kill script

You shouldn’t have to do these workarounds to get usable hardware with a good Ryzen CPU.

imrazor · June 30, 2018, 10:18pm

Well, I’d try to contact AMD, but I get this when I try go to the contact webform:

emailcustomercare.amd.com uses an invalid security certificate.

The certificate is not trusted because the issuer certificate is unknown. The server might not be sending the appropriate intermediate certificates. An additional root certificate may need to be imported.

Error code: SEC_ERROR_UNKNOWN_ISSUER

LOL, that’s one way to keep your RMA costs down…

catsay · June 30, 2018, 10:23pm

Just mail direct to [email protected]

OR

https://support.amd.com/en-us/contact/email-form

And you will get a ticket number back. Which a rep will respond to later at some point.

Also taking this into consideration you may need to go via Dell.
https://support.amd.com/en-us/warranty/oem

Works fine here

imrazor · June 30, 2018, 11:05pm

The site loaded in Chromium OK, but then the form timed out. But I got a confirmation email from AMD anyway. We’ll see if they have any suggestions, or offer an RMA.

noenken · June 30, 2018, 11:49pm

I agree that this should be no problem in the first place.
But maybe give processor.max_cstate=1 a shot?
Works for my laptop.

imrazor · July 1, 2018, 1:32am

Thanks, I’ll give it a shot. But doesn’t that kill your battery life? Have you RMA’d it?

noenken · July 1, 2018, 8:57am

Well, it does shorten the battery life a little probably but it is still clocking down just sitting there doing nothing. I won’t RMA, it works perfectly otherwise and on a laptop … if that’s all that is wrong with it, I take it.

catsay · July 1, 2018, 12:21pm

In @noenken’s case its a different kernel related issue and not a hardware problem.

I also had a defective R7 1700X with the MCE issue as you do. It doesn’t just crash under linux. Under the right circumstances that also would occur in windows.

noenkens issue is related to how the linux kernel handles C-states and power management of the Raven Ridge Vega GPU and PCI-e bus when C-state changes occur.

Essentially the linux kernel developers haven’t got all the software code in place to make it work right.

noenken · July 1, 2018, 12:25pm

Oh wow, I didn’t realize that. I thought it was also something that would be done in a BIOS update at some point. Yeah, if it’s hardware then of course be loud about it, @imrazor.

catsay · July 1, 2018, 12:27pm

MCE’s almost always (99.99%) means there’s a hardware issue.

imrazor · July 1, 2018, 6:31pm

Great, I hate trying to coerce support out of companies like Dell and AMD. Since it’s a consumer product and not an enterprise PC, it’ll be even harder. I should’ve gotten a workstation, but I don’t think there are any Ryzen workstations out there yet.

I’ll wait for a response from AMD, then I’ll ping Dell and see if i get any useful response. In the meantime, I’ll try disabling C6 with the zenstates.py script, then all C-states with @noenken’s kernel option if that fails.

If I’m forced to, would getting a brand new retail Ryzen gen 1. fix the issue?

EDIT: Oh, one other thing. Could this be a RAM issue? I’ve seen that mentioned in other threads.

catsay · July 1, 2018, 7:41pm

I’m almost 99.99% sure it’s not a RAM issue. If you tell me this also happens when your RAM is running at JEDEC (stock) spec then that makes the last 0.01% which makes it 100% sure.

imrazor · July 1, 2018, 10:36pm

Yep, my RAM is running at JEDEC speed (DDR4-2400). I don’t even have the option of overclocking RAM or XMP in the BIOS.

imrazor · July 2, 2018, 10:35pm

Got a prompt response from AMD. They want my serial number. Is there any way to get it without tearing apart the PC and removing the Dell cooler?