Mitigations=off considered harmful or spurious SIGILL on AMD Zen4

Hunting spurious illegal instructions where there were none on AMD Ryzen 7950x for months compiling OpenSource packages with GCC (for https://t2sde.org), I finally found this regular user-land code causing some pseudo random speculative execution state corruption on Zen 4! The good news: actually running with mitigations=on (I had it =off for performance) or disabling SMT works around it for now. https://www.youtube.com/watch?v=1UnoBfw6soI I guess one of the first documented cases of regular user programs causing speculative execution marginalities! So much for running with mitigations=off for performance even with your “own” trusted code, … not anymore. I’ll continue to further reduce the test case and hope to eventually reach AMD and linux-kernel to further find the root cause of this, … :-/

6 Likes

Wow, that is crazy. I wonder how many others have run into similar issues and did not attribute it to Speculative Execution bugs.

@rener Is there a significant hit in performance between mitigations on and mitigations off? I am still running the heavy machinery architecture on my desktop. My laptop is Zen3.5 (6800U) but I have to run MS Windows for school and I have not been able to get T2 or any GNU/Linux to run properly on the laptop just yet (I think TPM 2.0 issues). BSD seems to work fine, if not slow on the laptop though. haha.

3 Likes

thankfully currently on Zen4 the performance impact for my usecase with mitigations=on is barely meassurable. For Zen3 and Zen2 it was quite some % in the meantime with years of mitigations. A kernel dev commented in the meantime it could also be generic TLB bugs in the linux kernel. So the research will continue, …

2 Likes

I will give it a try on my Ryzen 7 6800U and my FX-6300 and see what happens.

It’s probably not a speculation problem at all but a correctness problem in AMD’s CPU cores. A similar problem was recently found in zen2. (zenbleed)

It stems from the CPU core logic being fundamentally unsound and causing corruption of the CPU’s internal state. Speculation just introduces a ton of added complexity which exposes the bug. Pile on memory barriers and it masks the corruption.

Guy found it through fuzzing, where he ran the same code with and without a ton of memory barriers before and after each instruction. Any time the results were different, it meant CPU state had been corrupted.

I’d report this to AMD immediately.

He did and has had no feed back for almost a month, I believe.

1 Like

I would not call it “fundamentally unsound”, it is likely “just” some SMT sibling state resource confusion, or miss speculation, or similar such low-level details difficult to get 100% right on overly complex CISC machines, … Of course I long reported it thru multiple means (AMD website, Linked in contacts, and LKML, however, so far the only productive thing from any AMD employ was: “LKML: Borislav Petkov: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation” LKML: Borislav Petkov: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

2 Likes

what became AMD Zen 4 Errata 1485 x86/urgent merged upstream: kernel/git/tip/tip.git - Unnamed repository; edit this file 'description' to name the repository.

4 Likes

Awesome work!

2 Likes

I wonder if Wendell will be able to patch the server with this soon?

if one run with default mitigations, including or explicitly only spectre_v2_user=on and thus STIBP this also appeared to prevent this illegal speculation bug to occur, … so most production servers are probably safe :wink: Maybe Zen4c was also already independently not affected, or maybe newer Epyc microcode, … don’t usually have a ton of those , …

1 Like

Are we sure that mitigations=off is still more performant than leaving them on? Especially on newer processors, there have been a lot of performance improvements the last year and maybe those were designed around these mitigations.

1 Like

Definitely a good point. It’s going to depend heavily on the exact CPU model and the specific mitigations enabled or disabled, and whether your workload ends up actually effected vs benchmarks that hammer one thing in particular. I’ve come across claims where turning off particular mitigations causes a decrease in performance in newer models.

I don’t know of any comprehensive testing and comparison though.

In @rener case, he is cross-compiling many different CPU Architectures at once. His use case is not the normal user’s use case, but he does get an uplift in performance by turning off the mitigations. I think it is less than 10% but when you are compiling 22 different architectures, that small percentage makes a difference.

2 Likes

yes

32-ish different architecture and build variants (glibc, must, x32, … :wink:

1 Like