AMD News Roundup: X399, Threadripper, Vega Demos, and More! (Early June 2017) | Level One Techs

foppe · June 9, 2017, 11:34am

@wendell: you said something about 1006 providing more power (to the cpu? soc? memory?) in some cases, could you say a bit more about what that was about? (Specifically, might it impact the segfault issues some people have been running into while compiling with most/all cores?)

wendell · June 9, 2017, 11:49am

Fma3 instructions bug was fixed with a bundled microcode update.
http://techreport.com/news/31621/amd-readies-a-fix-for-ryzen-fma3-bug

The segfault issues I think are way over blown and likely down to a faulty compiler or a bad overclock (don't forget that running memory over 2666 is technically an overclock).

The issues people are reporting are classic memory problems as far as I have read so far. If not a compiler bug. And I have been unable to reproduce on kernel 4.10 and newer distros. Debian testing, arch and fedora 25 and 26.

sanfordvdev · June 9, 2017, 12:06pm

So the IOMMU has been fixed or is going to be fixed in guture updates? About to possibly order a ryzen board and cpu.

mihawk90 · June 9, 2017, 12:08pm

Been fixed for the most part if AGESA 1006 has been deployed with your UEFI/BIOS Update by your motherboard manufacturer.

foppe · June 9, 2017, 12:16pm

Okay.
Dunno if it's overblown or not, but I ran into a few of them myself with everything at stock, cpu@auto (f25 VM on Qubes, r5 1600 icw AB350 Pro4 & cmk16gx4m2b3000c15 -- about 50% chance during a kernel compile lasting ~13m, with 10 threads assigned to that VM), and my situation actually improved since activating XMP, and OCing to 3.7GHz @ 1.325v.
I've just updated my bios to the new one, with agesa 1006 support, going to do a bit more testing later.

sanfordvdev · June 9, 2017, 12:17pm

What is still lacking since you said for the most part?

mihawk90 · June 9, 2017, 12:20pm

More on that here:

I don't know of any specific bugs, but I'm not all that much into it in the first place. Considering it's rather new I wouldn't expect it to be perfect though, that's what I meant.

wendell · June 9, 2017, 12:29pm

Try it and let us know please. I haven't been able to reproduce.
If it still fails try upping your soc voltage to 1.1v

Atatax · June 9, 2017, 1:56pm

i'm going to be so disappointed if Radeon RX Vega doesn't deliver at least 1080 level of performance. So much will power to not buy an nvidia card right now... I want to support AMD, and keep competition alive, but its taking so long...

catsay · June 9, 2017, 3:27pm

You don't need it now. If what you've got is better than mine, you'll be fine.

I've been fine using a GTX650 Ti on my Ryzen 1700x work desktop for ages now. It runs most games worth playing just fine at 1080p.

Mind you it is an attrrocious flaming garbage card overclocked past any legal limits but it works just fine, just need to spend less time gaming and more time being productive. Maybe sometime later this year it'll get replaced by a RX Vega card.

If I get really impatient I'd replace it with a RX580, can't really go wrong with those for the price.

mihawk90 · June 9, 2017, 3:30pm

If you can get a hold of them lol.

Like 110% of the resellers are out of stock here Damn miners...

catsay · June 9, 2017, 3:32pm

I'm in south africa, only got actual miners here.
So lots of RX580's available direct from china. Dat BRICS deal sometimes works.

Power is too expensive to waste on mining virtual money that you cant use anywhere really.
Unless you got solar panels that is, but very few do.

mihawk90 · June 9, 2017, 3:33pm

You could open a dealership for european customers then huehue

foppe · June 9, 2017, 4:53pm

Am still running into them, though less often still. With all 12 threads assigned, 3/10 compilation runs segfaulted, but since I'm too stupid to be able to generate core dumps for perusal, the best I can do is point out that there are a few fairly in-depth reports here: https://forums.gentoo.org/viewtopic-t-1061546-postdays-0-postorder-asc-start-175.html?sid=72fd9f884ff6a0898bd471e2146df1bc , https://community.amd.com/thread/215773

One thing about this that struck me as particularly odd was that some people got a huge decrease in the error rate when they disabled aslr (which is far from ideal, leaving aside that it doesn't make the problem go away entirely).

Pholostan · June 9, 2017, 5:20pm

I seriously doubt there is something wrong with GCC. Problem is observed on BSD too. Stock system have the bug. It sure looks like a hardware bug. Co-worker have a completely stock system, it has the problem when compiling.

http://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

https://community.amd.com/message/2796982

https://www.phoronix.com/forums/forum/hardware/processors-memory/955368-some-ryzen-linux-users-are-facing-issues-with-heavy-compilation-loads/page7#post955498

Quote (empatis mine):

Hi, Matt Dillon here. Yes, I did find what I believe to be a hardware issue with Ryzen related to concurrent operations. In a nutshell, for any given hyperthread pair, if one hyperthread is in a cpu-bound loop of any kind (can be in user mode), and the other hyperthread is returning from an interrupt via IRETQ, the hyperthread issuing the IRETQ can stall indefinitely until the other hyperthread with the cpu-bound loop pauses (aka HLT until next interrupt). After this situation occurs, the system appears to destabilize. The situation does not occur if the cpu-bound loop is on a different core than the core doing the IRETQ. The %rip the IRETQ returns to (e.g. userland %rip address) matters a LOT. The problem occurs more often with high %rip addresses such as near the top of the user stack, which is where DragonFly's signal trampoline traditionally resides. So a user program taking a signal on one thread while another thread is cpu-bound can cause this behavior. Changing the location of the signal trampoline makes it more difficult to reproduce the problem. I have not been able to completely mitigate it. When a cpu-thread stalls in this manner it appears to stall INSIDE the microcode for IRETQ. It doesn't make it to the return pc, and the cpu thread cannot take any IPIs or other hardware interrupts while in this state.

The bug is completely unrelated to overclocking. It is deterministically reproducable.

I sent a full test case off to AMD in April.

I should caution here that I only have ONE Ryzen system (1700X, Asus mobo), so its certainly possible that it is a bug in that system or a bug in DragonFly (though it seems unlikely given the particular hyperthread pairing characteristics of the bug). Only IRETQ seems to trigger it in the manner described above, which means that AMD can probably fix it with a microcode update.

-Matt

Sure looks like a hardware bug. Not too terrible though as it is probably fixable in a microcode update.

foppe · June 9, 2017, 5:48pm

This user found indications that it's specific to certain cpu/mobo combinations, and at least affected by the bios: https://community.amd.com/thread/215773#2803482

wendell · June 9, 2017, 6:09pm

Interesting. I've certainly done loads of compiling with 4 ryzen systems now (1600x, 1700, 1700x, 1800x) and haven't tripped over this issue BUT I can say that all 4 of those systems are running custom voltage/memory speed settings. e.g. higher than stock soc voltage, dram voltage, and load line calibration.
That may be part of why I haven't had much luck.

I don't have any asus boards to test on -- most of my testing has been on an asrock taichi and a gigabyte x370 gaming 5.

It does look as though in that thread that some users encountered the issues on specific boards:
Finally, I updated the BIOS on the ASRock board to P2.30 (AGESA 1.0.0.4a), installed my R7-1700 and ran my tests in that configuration. With default BIOS settings, the R7-1700 was able to compile software all night with no segfault or hard lockup.
_ _
So I'm going to try to RMA my MSI Board as it seems to be the common denominator in my case for the lockups.

so it may not be a problem with the ryzen cpus per se, or because I've been rocking agesa 1006 for almost two weeks now (well really one week as I was gone to taiwan) and the 1004a I ran in the beta uefis.. that may also be a contributing factor as to why I haven't seen it myself

Pholostan · June 9, 2017, 7:25pm

Ah yes, who runs their AMD system at stock?!?

lol

Seriously though, I have also read accounts on how setting higher LLC etc have improved the situation. One guy said that turning off all power saving features on his Asus board fixed the issue.

MisteryAngel · June 9, 2017, 11:28pm

Not sure which particular Msi board you have right now?
And what kind of overclocking, voltage numbers etc.
But the Nikos mosfets used on Msi boards arent that great unfortunatlly.
So i wouldn't be suprised if it were the motherboard that caused the lockups with an R7 OC'd.
If you did really push it.

mihawk90 · June 9, 2017, 11:49pm

That was a quote from the linked thread above.

And according to that post he had an MSI x370 Gaming Pro Carbon