AMD News Roundup: X399, Threadripper, Vega Demos, and More! (Early June 2017) | Level One Techs

Am still running into them, though less often still. With all 12 threads assigned, 3/10 compilation runs segfaulted, but since I'm too stupid to be able to generate core dumps for perusal, the best I can do is point out that there are a few fairly in-depth reports here: https://forums.gentoo.org/viewtopic-t-1061546-postdays-0-postorder-asc-start-175.html?sid=72fd9f884ff6a0898bd471e2146df1bc , https://community.amd.com/thread/215773

One thing about this that struck me as particularly odd was that some people got a huge decrease in the error rate when they disabled aslr (which is far from ideal, leaving aside that it doesn't make the problem go away entirely).

1 Like

I seriously doubt there is something wrong with GCC. Problem is observed on BSD too. Stock system have the bug. It sure looks like a hardware bug. Co-worker have a completely stock system, it has the problem when compiling.

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

https://community.amd.com/message/2796982

https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads/page7#post955498

Quote (empatis mine):

Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a LOT. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

The bug is completely unrelated to overclocking. It is deterministically reproducable.

I sent a full test case off to AMD in April.

I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update.

-Matt

Sure looks like a hardware bug. Not too terrible though as it is probably fixable in a microcode update.

1 Like

This user found indications that it's specific to certain cpu/mobo combinations, and at least affected by the bios: https://community.amd.com/thread/215773#2803482

Interesting. I've certainly done loads of compiling with 4 ryzen systems now (1600x, 1700, 1700x, 1800x) and haven't tripped over this issue BUT I can say that all 4 of those systems are running custom voltage/memory speed settings. e.g. higher than stock soc voltage, dram voltage, and load line calibration.
That may be part of why I haven't had much luck.

I don't have any asus boards to test on -- most of my testing has been on an asrock taichi and a gigabyte x370 gaming 5.

It does look as though in that thread that some users encountered the issues on specific boards:
Finally, I updated the BIOS on the ASRock board to P2.30 (AGESA 1.0.0.4a), installed my R7-1700 and ran my tests in that configuration. With default BIOS settings, the R7-1700 was able to compile software all night with no segfault or hard lockup.
_ _
So I'm going to try to RMA my MSI Board as it seems to be the common denominator in my case for the lockups.

so it may not be a problem with the ryzen cpus per se, or because I've been rocking agesa 1006 for almost two weeks now (well really one week as I was gone to taiwan) and the 1004a I ran in the beta uefis.. that may also be a contributing factor as to why I haven't seen it myself

1 Like

Ah yes, who runs their AMD system at stock?!?

lol :smiley:

Seriously though, I have also read accounts on how setting higher LLC etc have improved the situation. One guy said that turning off all power saving features on his Asus board fixed the issue.

1 Like

Not sure which particular Msi board you have right now?
And what kind of overclocking, voltage numbers etc.
But the Nikos mosfets used on Msi boards arent that great unfortunatlly.
So i wouldn't be suprised if it were the motherboard that caused the lockups with an R7 OC'd.
If you did really push it.

That was a quote from the linked thread above.

And according to that post he had an MSI x370 Gaming Pro Carbon