Ryzen crashing while idle

Yeah, sorry for the lack of updates, I’m trying work through pain of reboots and loss of work in last days.

As a quick update:

  • AMD support talk: Basically AMD said they are ready to RMA the CPU should I choose to do so, but they do not believe it’s the problem, which was also confirmed by swapping the CPU to my wife’s motherboard. The guy I’ve talked to (located in Canada) tried to replicate the problem with Ubuntu 20.10, 5900X and Asus X570 Pro board, but no luck, it was running stable for a week. He said it looks like the problem is with particular motherboard, most likely with power delivery. In addition he confirmed that this error IS NOT with the CPU, this is the even that CPU catches and reports back. The error point to the timeout for the data reception, which is probably caused by slow power state change due to poor power delivery. Current state is: they are waiting for me to exchange the board and report the results.
  • MSI ticket was closed for me with a quick suggestion to proceed with RMA, however I need to do it with the store I’ve purchased it, not with MSI directly.
  • I’ve talked to the store (I’m located in Germany, but everything slowed down by the fact that my German is not that good just yet) and they obviously need to have a look at motherboard first, so I need to send it for repair and then it would be either repaired or replaced.
  • Current status, I had a lot of work and I was moving data to my laptop and just switched today away from the desktop, so I’ll remove the motherboard and will send it back tomorrow.

Thanks, appreciate your help. For now I think I’m good, I have an English speaking support guy from ComuteUniverse and luckily I’ve ordered most of my system there. CPUs were purchased at mindfactory, but AMD works directly with me, should I need to replace the unit.

I also got my stuff from Mindfactory. Let us know if there’s anything new to your case.

In the meantime I was told to try overclocking or undervolting my RTX 2080 Super. Any suggestions if this is a good idea? I was told that any GPU was likely to produce peaks even when in idle.

Just a quick update from my side:

Store received my motherboard and confirmed there is no physical damage to it, so the sent it back to MSI for investigation that suppose to take ~14 days (counting from Friday 26th).

P.S. It turned out to be unbelievably painful to switch from my desktop to Intel-based laptop…so I hope MSI will send a new motherboard soon.

Finally got my Motherboard back yesterday, wow, it was a tough several weeks working back on a laptop. I didn’t get any conclusion neither from the store, nor from the MSI, just a status that it’s been repaired (not RMA’ed). I couldn’t find any visual indication if anything was done to the board, lets see…

So, just another quick update, again, I couldn’t find what’s different about the motherboard, but it’s 3.5 days with absolutely default settings (only enabled XMP and SVM), idle current is set to Auto, c-state control enabled etc, and running completely stable. I’m also now running kernel 5.11.5 (I saw some people on BZ reporting it aggravates the problem) and it’s also perfectly fine. So…while I’d love to see at least 3-4 weeks of stable before conclusion, I’d say this “repair” worked and it was indeed the motherboard as all other components remains the same.

Scratch that… just got system reset ffs…

So, given that motherboard “whateveritwas” didn’t help, I’ve swapped my wife’s GPU (Gigabyte RTX 3070) with my RX 5700XT. So far my system didn’t crashed, however, interesting fact is that my wife’s PC rebooted on boot with MCE error… I’ll try to run this configuration for at least a week to see what would come out out of it. I’ve also updated to latest AGESA but it crashed just as well.

I for myself also had like 3 weeks with only one crash recently - only thing I did was active XMP, that’s it; everything else was left on default. Just a few days ago the crashes started again but none in idle mode. Most crashes happen while playing games on Steam (blackscreen only) and I also had my system crash when I was burning some data on DVD last Friday. Came back to my PC to find the BSoD with the sad ASCII-Face :frowning:

I am currently running ComboAM4PIV2 because for my B550-A PRO MB there is currently only a beta for available. Going to try this one as soon as it’s out of beta status.

Anyone tried the latest GRD v461.92 from NVIDIA already?

So, it’s been 9 days running my system with nVidia RTX 3070 with zero issues. No tweaks to the BIOS, kernel boot parameters, everything is at default and running smooth and stable. Like I’ve mentioned initially, seems like I see similar (but not 100% the same) issues on my wife’s PC now, however looks like windows driver seems to be better handling hangs and other issues, so it manages to recover from some GPU crashes. MCE error count stays at one so far. I’m more than happy to try other GPUs but current market state does not allow me.
Can we do a quick check which GPUs people with problems are using? I’m having a slight feeling about PCI-E generation settings, especially in a light of the latest USB drop out issues which was also worked around with a PCIE switch to gen3.
I’ll be switching back to my card on Wednesday so I can report and look at the problem as Windows behavior is at least better, so there is definitely room for improvement.

Update on my end. I decided to forgo RMAing the board and instead just bought a new one. I switched to an ASUS ROG Crosshair VIII Hero about two weeks ago and haven’t had any problems since. At this point I’m fairly confident it was the motherboard.

@agurenko Regarding your question on the GPU - I am using ZOTAC Gaming GeForce RTX 2080 SUPER Triple Fan-Edition.

What do you mean by saying ‘windows driver seems to be better handling hangs …’? Don’t you use any nVidia drivers on your wife’s PC at all?

@nerDrums I meant that I’m using only Linux and she is only using Windows, hence windows AMD driver seems to be handling GPU reset better than Linux amdgpu driver. Can you try to switch your PCI-E port to gen3 mode and see if it makes the difference? I’m just guessing here, but I’m all out of options.

Also I’ve opened following defect: [mce] random reboots Machine Check: 0 Bank 5: bea0000000000108 (#1551) · Issues · drm / amd · GitLab

Keep us posted with the results. At this point, I’m still convinced it’s a combination of factors, but since we don’t have any interested party, we’ll probably never know…

Cool keep us updated on it.

Actually I can’t really follow this instruction due to lack of knowledge :sweat_smile: I am a simple user, so before all this I never had to make any changes in my BIOS which was a very big step for me. Tempting with the hardware is something completely differerent.

How would I switch this PCI-E port to gen3? I am using a MSI B550-A PRO motherboard with MSI Click BIOS 5.

In a BIOS go to Settings → Advanced → PCI Subsystem Settings

and change PCI_E1 Gen Switch from Auto to Gen3.

I couldn’t find if it also switches nvme port to gen3, but I would guess so.
Not sure about the B550 boards, but AFAIK nvme 0 and PCIe x16 port 0 both connected to the CPU and that’s the switch for that as I have separate switch for the PCIe ports connected to the chipset (X570 in my case), which you probably don’t have.

@agurenko So, I checked and found the settings you mentioned - thanks for the guidance. But please tell me what exactely does this do? What is the difference if it’s on Auto or Gen3?

Btw. sorry for the late answer. I was busy all week long but now it’s four days off for the Easter holidays. Hope you guys have a good weekend!

Auto is what MSI calls the default. Gen3 forces PCIe Gen3 speeds for the Primary GPU slot.
I still got one restart in last week, but it only happened once so far, so generally my stability greatly improved, so it does not look like it’s a permanent solution, but no one is looking into BZs or amdgpu reports, so we’ll have to live with that I guess.

