Ryzen 7 1700 machine check error - instant reboot

imrazor · June 30, 2018, 11:05pm

The site loaded in Chromium OK, but then the form timed out. But I got a confirmation email from AMD anyway. We’ll see if they have any suggestions, or offer an RMA.

noenken · June 30, 2018, 11:49pm

I agree that this should be no problem in the first place.
But maybe give processor.max_cstate=1 a shot?
Works for my laptop.

imrazor · July 1, 2018, 1:32am

Thanks, I’ll give it a shot. But doesn’t that kill your battery life? Have you RMA’d it?

noenken · July 1, 2018, 8:57am

Well, it does shorten the battery life a little probably but it is still clocking down just sitting there doing nothing. I won’t RMA, it works perfectly otherwise and on a laptop … if that’s all that is wrong with it, I take it.

catsay · July 1, 2018, 12:21pm

In @noenken’s case its a different kernel related issue and not a hardware problem.

I also had a defective R7 1700X with the MCE issue as you do. It doesn’t just crash under linux. Under the right circumstances that also would occur in windows.

noenkens issue is related to how the linux kernel handles C-states and power management of the Raven Ridge Vega GPU and PCI-e bus when C-state changes occur.

Essentially the linux kernel developers haven’t got all the software code in place to make it work right.

noenken · July 1, 2018, 12:25pm

Oh wow, I didn’t realize that. I thought it was also something that would be done in a BIOS update at some point. Yeah, if it’s hardware then of course be loud about it, @imrazor.

catsay · July 1, 2018, 12:27pm

MCE’s almost always (99.99%) means there’s a hardware issue.

imrazor · July 1, 2018, 6:31pm

Great, I hate trying to coerce support out of companies like Dell and AMD. Since it’s a consumer product and not an enterprise PC, it’ll be even harder. I should’ve gotten a workstation, but I don’t think there are any Ryzen workstations out there yet.

I’ll wait for a response from AMD, then I’ll ping Dell and see if i get any useful response. In the meantime, I’ll try disabling C6 with the zenstates.py script, then all C-states with @noenken’s kernel option if that fails.

If I’m forced to, would getting a brand new retail Ryzen gen 1. fix the issue?

EDIT: Oh, one other thing. Could this be a RAM issue? I’ve seen that mentioned in other threads.

catsay · July 1, 2018, 7:41pm

I’m almost 99.99% sure it’s not a RAM issue. If you tell me this also happens when your RAM is running at JEDEC (stock) spec then that makes the last 0.01% which makes it 100% sure.

imrazor · July 1, 2018, 10:36pm

Yep, my RAM is running at JEDEC speed (DDR4-2400). I don’t even have the option of overclocking RAM or XMP in the BIOS.

imrazor · July 2, 2018, 10:35pm

Got a prompt response from AMD. They want my serial number. Is there any way to get it without tearing apart the PC and removing the Dell cooler?

noenken · July 2, 2018, 10:59pm

Did you get Dell paperwork? Maybe in there?

imrazor · July 3, 2018, 12:17am

I kept the box and paperwork for about 30 days, then tossed it. I don’t think I’ll be able to tear down the PC until this weekend.

SgtAwesomesauce · July 3, 2018, 2:38am

Some (not many)bios display the serial number of the CPU, might be worth checking.

catsay · July 3, 2018, 2:02pm

Got to take the cooler off.
That’s the only place it’s usually located unless dell printed/stored it on somewhere else aswell.

https://imgur.com/bwpKVrr

It should start with 9R6 and then xxxxxxx numbers.

imrazor · July 4, 2018, 3:11am

Seeking a little clarity here. I tried the kill-ryzen.sh script (on Ubuntu 17.04) you posted a couple of days ago to test for the Ryzen segfault bug. It ran for about an hour and a half without crashing, and normally trips within 5 minutes or so. @catsay that makes me doubt that it really is your early adopter bug. I bought this PC in Jan 2018.

Another difference between my problem and the SEGV bug is that bug occurs under heavy load, whereas my crash always occurs when the system is idle, or being prodded out of idle by a mouse click. I thought it might be @noenken’s bug, but mine is not a PCIe error, but an MCE crash.

So I’m not really sure I’ve found the root of the problem. There have been a few theories about problems with PSUs not supporting C6, but that seems far-fetched. The system also came with a single 16GB stick. I bought a “system compatible” stick from Kingston (same as the OEM stick) with the same timings, but I’m starting to suspect it too.

Perhaps I should try Windows for a few days and see if I experience any crashes. If it truly is hardware, I should also see problems with Windows, which I just haven’t used much.

Jimster480 · July 4, 2018, 4:49am

Does it only occur when you have alot of ram in use?
I can tell you I have 32G DDR4 with real nice lean timings and I have experienced a couple weird hardware crashes when I’m at 25+GB of memory in use.
I only use windows 10 so… it could be ram also but I have a R7 1700x and its not my OC or anything.
But I haven’t been able to accurately reproduce it, and now as of recent I have a very weird bug where running any website that contains flash (even if flash is disabled) in any browser instantly takes down the entire system.

catsay · July 4, 2018, 7:54am

Many where affected by a number of issues.
SEGV is just the big one.

There where also:
Freezes - Instant lockups
Hard resets
Memory & Cache corruption issues leading to MCE events

These where caused by
Number of C-state issues in silicon
Cache memory defects
Branch prediction problems
FMA3 load induced voltage drops

I have now tested many Ryzen systems under linux and of all those about 5% had C-state issues as a result of hardware defects resolved only by an RMA.

Some chips that where not affected by the SEGV issue may still encounter the MCE events. The very second chip I had behaved exactly like yours. No MCE’s under testing but would randomly at idle hard reset or freeze.

You can try increasing SoC voltage, disabling C6, limiting to C1 max. But in all honesty none of this should be necessary.
Perhaps your mainboard also just needs a bios update?

11142 · June 11, 2021, 12:00am

How the story ended. I have same problem.

imrazor · June 11, 2021, 4:45am

Windows never had a problem with the CPU, so I took Linux off the machine. Six months or so later I put it back on, and no problem. I’ve been running the CPU overclocked for the past year or so. Recently I had to reinstate it as my main gaming rig. Doing nicely with a 2070 Super in both OS’s.

Must’ve been a bug. Either the kernel code was corrected, or they wrote around a firmware/microcode bug. Very stable now.