Ryzen Pre-Week 25 fabrication RMA issue

Disabling ASLR is working a little too well. Which is odd, given that it did next to nothing for ryzen-kill gcc builds. Usually failed at around 2/6 or 28hours.

But I fear it’s only moved the criteria for a problem to occur further away.

But yeah. It’s usually the Instruction Pointer that at some point goes wildly off track.

I should add. I have 2 Ryzen 1700X systems.
My 1 CPU basically has all of the problems. Segfaults,MCE,Reboots,Freezes,VTT drop. Where as my other 1700X has none of those.
3 of those are solved by just manually setting RAM timings + Voltages to what the stock should be.
However the Segfaults and VTT drop under load are something else though.

Nope ASLR No Good. Like I thought it only moves the criteria.

And a second build passed again.

So, I’m too tired/lazy/stupid to figure out the multithreaded way right now but I can recreate the error with Pamac even on those fixed voltages up to 1.375 outlined above. (I’m not gonna push 1.4v or more into a 65W TDP CPU) So yeah, gonna start the RMA.

Actually not the worst time to be honest. The rig is currently still open and I will take this chance to downgrade to a 1600X. I am hoping to get the same gaming + recording performance as with my 1700 when overclocked but with a little less heat.

@catsay I will gladly run your script when I’m not tired/lazy anymore.

Gnight.

Yeah I’ll get to it at some point. Probably tomorrow evening or thursday morning.

Currently tired as hell too and slugging through some obscure as hell CPU doc I found.
Hopefully AMD has an RMA chip for me at some point, but it seems like they are real busy there.

The instructions for Fedora on the nav panel are for 21-22. On Fedora 26, I get something similar to iostream not found, fatal error . The dependencies do not compile successfully so I can’t even try compiling Blender.

Blender is a huge project. Installing all of those dependencies to do the Ryzen test with is error prone and causes all sorts of other issues. For example…

Trying to install all the dependencies actually took a very long time (an hour or so) while some part of the dependency script updated the entire Fedora 26 RTM installation, downloading hundreds of MB from the interwebs while doing so, and even installed a new kernel. This is bad for people with terrible internet speeds and for those just wanting a small reliable test. The kill-ryzen-sh GCC test does this.

GCC Test # Fedora 26 Kernel RAM failure times (s)
1 Ubuntu VM 24 GB 380
2 RTM 32 GB 240
3 RTM 32 GB 240
4 RTM 32 GB 280
5 RTM 32 GB 42
6 RTM 32 GB 239
7 RTM 32 GB 242
8 RTM 16 GB 298
9 RTM 16 GB 288
10 RTM 32 GB 239
11 RTM 32 GB 239
12 RTM 16 GB 180
13 New 32 GB 948
14 New 32 GB 1525
15 RTM 32 GB 2842
16 RTM 32 GB ~180s (Kernel Panic)
17 New 32 GB no crash (2 hrs)
18 RTM 32 GB 3638 (1hr)

RTM means 4.11.8-300.fc26.x86_64
New means 4.12.9-300.fc26.x86_64
Ubuntu VM means on a Windows host

Conclusion: It should be pretty obvious when updates were applied by looking at the test #'s and failure times. Using a newer kernel/software only masks the problem. This is not a software issue. Mitigations at the software-level can only prologue the expected time to failure. The chips are still defective and need to RMA’d.

If Blender requires/defaults to a newer kernel/updated packages, that might be why the GCC test does not fail in a timely manner. I would advise against creating a kill-ryzen-sh script based on Blender since the installing dependencies process takes a long time, especially for users with poor net speeds, and newer kernels/software (installed by default) only mask the issue.

If you cannot get the GCC test to “work”, try using a fresh install, no updates.

Oh believe me the GCC kill-ryzen test works.
It just takes aaages on this machine. I’m talking sometimes 14 to 28 hours.

Whereas I had a blender git repo already available to build with that would segfault much more reliably (just minutes).

Just checked a picture I took when I got my Ryzen and i turns out, it might be a faulty one :frowning:
Any idea how long the RMA process takes (I haven’t tested it yet though).

Can confirm, I just used the Manjaro GUI packet manager to start the build. On fixed settings I get Error 2 or segfault at around 5-7% of the process. And that is with all dependencies already done. Just for lolz I did the same thing on my core-m skylake notebook and it went through like a charm.

It takes a while.

It’s always a good idea to start sooner rather than later.

1 Like

I intent to. Even though I don’t have any problems now, I’d rather RMA my CPU while I still can, before I’m facing any issues.
That being said, is there some kind of a date up until RMAs of this kind are accepted? Or does it void when the warranty expires?

I specifically cited this issue, and my batch number, and the RMA was approved “just like that.” Also, my replacement was shipped as soon as I dropped off the box and gave them the FedEx tracking number (or so I was told).

However, from what I understand, none of this is “official.” Not sure the if even the defect itself is “official.”

You’d probably have a much harder time of it if you waited so long. Personally, I’d return it now; but up to you to decide if you’d rather leave well-enough alone.

1 Like

You need to test it. You cannot know if yours is or is not affected without testing it. Instructions for testing are available, just scroll up.

It appears to be a significant number of defective chips were shipped but there are a lot of chips that do not have the issue.

1 Like

Hmm I will have to try that on mine. I gave up after 8-10 hours

I’m aware of that, besides I wouldn’t want to replace a working CPU :wink:

I tried to, but as it turns out there might be some problems

Edit: You mentioned something about testing it in a VM. Is it enough to download and install Fedora 26 in VirtualBox, grant the guest all cores, and run the above test?

I had something like that once, too. But I was testing out settings for CPU and memory so I just thought whatever and hit reset. Never saw it again but it might be an indicator… maybe?

Thought I’d ask the obvious question here, since I read somewhere that Threadripper is based on multiple 1800X cores ‘glued’ together ( I may be tad off there) - in any case, has anyone tried similar testing on the 1920X/1950X?

Clarification At least what I meant to say was, has anyone seen seg-faults in TR, similar to this - irrespective of fabrication date?

Well, I didn’t really change anything to begin with, except enabling virtualisation support (SMD? something like that?). I don’t know why ASRock keeps disabling it, I need to enable it after every BIOS update.

My problem might be non-QVL RAM though…

From the phoronix article linked by the OP:

AMD has confirmed this issue doesn’t affect EPYC or Threadripper processors […]

1 Like

Interesting… Cheers @pFtpr

Oh, yeah, been there. xD