Ryzen Pre-Week 25 fabrication RMA issue

catsay · September 5, 2017, 7:15pm

Disabling ASLR is working a little too well. Which is odd, given that it did next to nothing for ryzen-kill gcc builds. Usually failed at around 2/6 or 28hours.

But I fear it’s only moved the criteria for a problem to occur further away.

But yeah. It’s usually the Instruction Pointer that at some point goes wildly off track.

I should add. I have 2 Ryzen 1700X systems.
My 1 CPU basically has all of the problems. Segfaults,MCE,Reboots,Freezes,VTT drop. Where as my other 1700X has none of those.
3 of those are solved by just manually setting RAM timings + Voltages to what the stock should be.
However the Segfaults and VTT drop under load are something else though.

catsay · September 5, 2017, 7:36pm

Nope ASLR No Good. Like I thought it only moves the criteria.

And a second build passed again.

noenken · September 5, 2017, 9:37pm

So, I’m too tired/lazy/stupid to figure out the multithreaded way right now but I can recreate the error with Pamac even on those fixed voltages up to 1.375 outlined above. (I’m not gonna push 1.4v or more into a 65W TDP CPU) So yeah, gonna start the RMA.

Actually not the worst time to be honest. The rig is currently still open and I will take this chance to downgrade to a 1600X. I am hoping to get the same gaming + recording performance as with my 1700 when overclocked but with a little less heat.

@catsay I will gladly run your script when I’m not tired/lazy anymore.

Gnight.

catsay · September 5, 2017, 10:05pm

Yeah I’ll get to it at some point. Probably tomorrow evening or thursday morning.

Currently tired as hell too and slugging through some obscure as hell CPU doc I found.
Hopefully AMD has an RMA chip for me at some point, but it seems like they are real busy there.

Peanut253 · September 6, 2017, 1:58pm

The instructions for Fedora on the nav panel are for 21-22. On Fedora 26, I get something similar to iostream not found, fatal error . The dependencies do not compile successfully so I can’t even try compiling Blender.

Blender is a huge project. Installing all of those dependencies to do the Ryzen test with is error prone and causes all sorts of other issues. For example…

Trying to install all the dependencies actually took a very long time (an hour or so) while some part of the dependency script updated the entire Fedora 26 RTM installation, downloading hundreds of MB from the interwebs while doing so, and even installed a new kernel. This is bad for people with terrible internet speeds and for those just wanting a small reliable test. The kill-ryzen-sh GCC test does this.

GCC Test #	Fedora 26 Kernel	RAM	failure times (s)
1	Ubuntu VM	24 GB	380
2	RTM	32 GB	240
3	RTM	32 GB	240
4	RTM	32 GB	280
5	RTM	32 GB	42
6	RTM	32 GB	239
7	RTM	32 GB	242
8	RTM	16 GB	298
9	RTM	16 GB	288
10	RTM	32 GB	239
11	RTM	32 GB	239
12	RTM	16 GB	180
13	New	32 GB	948
14	New	32 GB	1525
15	RTM	32 GB	2842
16	RTM	32 GB	~180s (Kernel Panic)
17	New	32 GB	no crash (2 hrs)
18	RTM	32 GB	3638 (1hr)

RTM means 4.11.8-300.fc26.x86_64
New means 4.12.9-300.fc26.x86_64
Ubuntu VM means on a Windows host

Conclusion: It should be pretty obvious when updates were applied by looking at the test #'s and failure times. Using a newer kernel/software only masks the problem. This is not a software issue. Mitigations at the software-level can only prologue the expected time to failure. The chips are still defective and need to RMA’d.

If Blender requires/defaults to a newer kernel/updated packages, that might be why the GCC test does not fail in a timely manner. I would advise against creating a kill-ryzen-sh script based on Blender since the installing dependencies process takes a long time, especially for users with poor net speeds, and newer kernels/software (installed by default) only mask the issue.

If you cannot get the GCC test to “work”, try using a fresh install, no updates.

catsay · September 6, 2017, 4:43pm

Oh believe me the GCC kill-ryzen test works.
It just takes aaages on this machine. I’m talking sometimes 14 to 28 hours.

Whereas I had a blender git repo already available to build with that would segfault much more reliably (just minutes).

Azulath · September 6, 2017, 6:42pm

Just checked a picture I took when I got my Ryzen and i turns out, it might be a faulty one
Any idea how long the RMA process takes (I haven’t tested it yet though).

noenken · September 6, 2017, 7:46pm

Can confirm, I just used the Manjaro GUI packet manager to start the build. On fixed settings I get Error 2 or segfault at around 5-7% of the process. And that is with all dependencies already done. Just for lolz I did the same thing on my core-m skylake notebook and it went through like a charm.

catsay · September 6, 2017, 8:02pm

It takes a while.

It’s always a good idea to start sooner rather than later.

Azulath · September 6, 2017, 8:40pm

I intent to. Even though I don’t have any problems now, I’d rather RMA my CPU while I still can, before I’m facing any issues.
That being said, is there some kind of a date up until RMAs of this kind are accepted? Or does it void when the warranty expires?

_adrian · September 6, 2017, 9:04pm

I specifically cited this issue, and my batch number, and the RMA was approved “just like that.” Also, my replacement was shipped as soon as I dropped off the box and gave them the FedEx tracking number (or so I was told).

However, from what I understand, none of this is “official.” Not sure the if even the defect itself is “official.”

You’d probably have a much harder time of it if you waited so long. Personally, I’d return it now; but up to you to decide if you’d rather leave well-enough alone.

Peanut253 · September 7, 2017, 3:51am

You need to test it. You cannot know if yours is or is not affected without testing it. Instructions for testing are available, just scroll up.

It appears to be a significant number of defective chips were shipped but there are a lot of chips that do not have the issue.

wendell · September 7, 2017, 5:21am

Hmm I will have to try that on mine. I gave up after 8-10 hours

Azulath · September 7, 2017, 5:39pm

I’m aware of that, besides I wouldn’t want to replace a working CPU

I tried to, but as it turns out there might be some problems…

Edit: You mentioned something about testing it in a VM. Is it enough to download and install Fedora 26 in VirtualBox, grant the guest all cores, and run the above test?

noenken · September 7, 2017, 6:21pm

I had something like that once, too. But I was testing out settings for CPU and memory so I just thought whatever and hit reset. Never saw it again but it might be an indicator… maybe?

bsodmike · September 7, 2017, 6:24pm

Thought I’d ask the obvious question here, since I read somewhere that Threadripper is based on multiple 1800X cores ‘glued’ together ( I may be tad off there) - in any case, has anyone tried similar testing on the 1920X/1950X?

Clarification At least what I meant to say was, has anyone seen seg-faults in TR, similar to this - irrespective of fabrication date?

Azulath · September 7, 2017, 6:25pm

Well, I didn’t really change anything to begin with, except enabling virtualisation support (SMD? something like that?). I don’t know why ASRock keeps disabling it, I need to enable it after every BIOS update.

My problem might be non-QVL RAM though…

pFtpr · September 7, 2017, 6:25pm

From the phoronix article linked by the OP:

AMD has confirmed this issue doesn’t affect EPYC or Threadripper processors […]

bsodmike · September 7, 2017, 6:27pm

Interesting… Cheers @pFtpr

noenken · September 7, 2017, 6:32pm

Oh, yeah, been there. xD