One thing about this that struck me as particularly odd was that some people got a huge decrease in the error rate when they disabled aslr (which is far from ideal, leaving aside that it doesn't make the problem go away entirely).
I seriously doubt there is something wrong with GCC. Problem is observed on BSD too. Stock system have the bug. It sure looks like a hardware bug. Co-worker have a completely stock system, it has the problem when compiling.
Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a LOT. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.
The bug is completely unrelated to overclocking. It is deterministically reproducable.
I sent a full test case off to AMD in April.
I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update.
-Matt
Sure looks like a hardware bug. Not too terrible though as it is probably fixable in a microcode update.
Interesting. I've certainly done loads of compiling with 4 ryzen systems now (1600x, 1700, 1700x, 1800x) and haven't tripped over this issue BUT I can say that all 4 of those systems are running custom voltage/memory speed settings. e.g. higher than stock soc voltage, dram voltage, and load line calibration. That may be part of why I haven't had much luck.
I don't have any asus boards to test on -- most of my testing has been on an asrock taichi and a gigabyte x370 gaming 5.
It does look as though in that thread that some users encountered the issues on specific boards: Finally, I updated the BIOS on the ASRock board to P2.30 (AGESA 1.0.0.4a), installed my R7-1700 and ran my tests in that configuration. With default BIOS settings, the R7-1700 was able to compile software all night with no segfault or hard lockup. _ _ So I'm going to try to RMA my MSI Board as it seems to be the common denominator in my case for the lockups.
so it may not be a problem with the ryzen cpus per se, or because I've been rocking agesa 1006 for almost two weeks now (well really one week as I was gone to taiwan) and the 1004a I ran in the beta uefis.. that may also be a contributing factor as to why I haven't seen it myself
Seriously though, I have also read accounts on how setting higher LLC etc have improved the situation. One guy said that turning off all power saving features on his Asus board fixed the issue.
Not sure which particular Msi board you have right now? And what kind of overclocking, voltage numbers etc. But the Nikos mosfets used on Msi boards arent that great unfortunatlly. So i wouldn't be suprised if it were the motherboard that caused the lockups with an R7 OC'd. If you did really push it.