Ryzen Pre-Week 25 fabrication RMA issue

Reading this thread for a while now. Today I took the Cooler off the CPU and checked the serial, the part and the batch number. UA 1709PGT from Ryzen 1700. SHIT!
I was curious, so I stopped my work, git cloned the ryzen-kill script started the test.
Seems like I have a bad one.

Extract GCC sources
Download prerequisites
2017-10-01 15:10:35 URL: ftp://gcc.gnu.org/pub/gcc/infrastructure/gmp-6.1.0.tar.bz2 [2383840] -> “./gmp-6.1.0.tar.bz2” [1]
2017-10-01 15:10:40 URL: ftp://gcc.gnu.org/pub/gcc/infrastructure/mpfr-3.1.4.tar.bz2 [1279284] -> “./mpfr-3.1.4.tar.bz2” [1]
2017-10-01 15:10:45 URL: ftp://gcc.gnu.org/pub/gcc/infrastructure/mpc-1.0.3.tar.gz [669925] -> “./mpc-1.0.3.tar.gz” [1]
2017-10-01 15:10:50 URL: ftp://gcc.gnu.org/pub/gcc/infrastructure/isl-0.16.1.tar.bz2 [1626446] -> “./isl-0.16.1.tar.bz2” [1]
gmp-6.1.0.tar.bz2: OK
mpfr-3.1.4.tar.bz2: OK
mpc-1.0.3.tar.gz: OK
isl-0.16.1.tar.bz2: OK
All prerequisites downloaded successfully.
cat /proc/cpuinfo | grep -i -E "(model name|microcode)"
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
model name : AMD Ryzen 7 1700 Eight-Core Processor
microcode : 0x8001129
sudo dmidecode -t memory | grep -i -E “(rank|speed|part)” | grep -v -i unknown
Speed: 2134 MHz
Part Number: F4-3200C16-16GTZSW
Rank: 2
Configured Clock Speed: 1067 MHz
Speed: 2134 MHz
Part Number: F4-3200C16-16GTZSW
Rank: 2
Configured Clock Speed: 1067 MHz
Speed: 2134 MHz
Part Number: F4-3200C16-16GTZSW
Rank: 2
Configured Clock Speed: 1067 MHz
Speed: 2134 MHz
Part Number: F4-3200C16-16GTZSW
Rank: 2
Configured Clock Speed: 1067 MHz
uname -a
Linux COMPUTERNAME 4.11.0-14-generic #20~16.04.1-Ubuntu SMP Wed Aug 9 09:06:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
cat /proc/sys/kernel/randomize_va_space
2
/ /mnt/ramdisk/workdir
/mnt/ramdisk/workdir
Using 16 parallel processes
Hint: You are currently not seeing messages from other users and the system.
Users in the ‘systemd-journal’ group can see all messages. Pass -q to
turn off this notice.
No journal files were opened due to insufficient permissions.
[loop-0] Sun Oct 1 15:10:51 CEST 2017 start 0
[loop-1] Sun Oct 1 15:10:52 CEST 2017 start 0
[loop-2] Sun Oct 1 15:10:53 CEST 2017 start 0
[loop-3] Sun Oct 1 15:10:54 CEST 2017 start 0
[loop-4] Sun Oct 1 15:10:55 CEST 2017 start 0
[loop-5] Sun Oct 1 15:10:56 CEST 2017 start 0
[loop-6] Sun Oct 1 15:10:57 CEST 2017 start 0
[loop-7] Sun Oct 1 15:10:58 CEST 2017 start 0
[loop-8] Sun Oct 1 15:10:59 CEST 2017 start 0
[loop-9] Sun Oct 1 15:11:00 CEST 2017 start 0
[loop-10] Sun Oct 1 15:11:01 CEST 2017 start 0
[loop-11] Sun Oct 1 15:11:02 CEST 2017 start 0
[loop-12] Sun Oct 1 15:11:03 CEST 2017 start 0
[loop-13] Sun Oct 1 15:11:04 CEST 2017 start 0
[loop-14] Sun Oct 1 15:11:05 CEST 2017 start 0
[loop-15] Sun Oct 1 15:11:06 CEST 2017 start 0
> [loop-9] Sun Oct 1 15:12:31 CEST 2017 build failed
> [loop-9] TIME TO FAIL: 100 s
> [loop-10] Sun Oct 1 15:12:52 CEST 2017 build failed
> [loop-10] TIME TO FAIL: 121 s
> [loop-1] Sun Oct 1 15:49:45 CEST 2017 build failed
> [loop-1] TIME TO FAIL: 2334 s
> [loop-2] Sun Oct 1 18:05:58 CEST 2017 build failed
> [loop-2] TIME TO FAIL: 10507 s
> [loop-3] Sun Oct 1 18:08:18 CEST 2017 build failed
> [loop-3] TIME TO FAIL: 10647 s
> [loop-5] Sun Oct 1 18:38:33 CEST 2017 build failed
> [loop-5] TIME TO FAIL: 12462 s
[loop-6] Sun Oct 1 18:46:49 CEST 2017 start 1
[loop-4] Sun Oct 1 18:46:50 CEST 2017 start 1
[loop-7] Sun Oct 1 18:47:04 CEST 2017 start 1
[loop-13] Sun Oct 1 18:47:16 CEST 2017 start 1
[loop-0] Sun Oct 1 18:47:21 CEST 2017 start 1
[loop-15] Sun Oct 1 18:47:24 CEST 2017 start 1
[loop-12] Sun Oct 1 18:47:26 CEST 2017 start 1
[loop-8] Sun Oct 1 18:47:36 CEST 2017 start 1
[loop-14] Sun Oct 1 18:47:39 CEST 2017 start 1
> [loop-7] Sun Oct 1 18:48:04 CEST 2017 build failed
> [loop-7] TIME TO FAIL: 13033 s
[loop-11] Sun Oct 1 18:48:14 CEST 2017 start 1
> [loop-12] Sun Oct 1 19:21:48 CEST 2017 build failed
> [loop-12] TIME TO FAIL: 15057 s
> [loop-13] Sun Oct 1 19:21:48 CEST 2017 build failed
> [loop-13] TIME TO FAIL: 15057 s
> [loop-11] Sun Oct 1 21:35:50 CEST 2017 build failed
> [loop-11] TIME TO FAIL: 23099 s
[loop-6] Sun Oct 1 21:39:22 CEST 2017 start 2
[loop-4] Sun Oct 1 21:39:30 CEST 2017 start 2
[loop-15] Sun Oct 1 21:39:34 CEST 2017 start 2
[loop-8] Sun Oct 1 21:39:34 CEST 2017 start 2
[loop-14] Sun Oct 1 21:39:46 CEST 2017 start 2
[loop-0] Sun Oct 1 21:39:53 CEST 2017 start 2
> [loop-8] Sun Oct 1 23:59:22 CEST 2017 build failed
> [loop-8] TIME TO FAIL: 31711 s
[loop-6] Mon Oct 2 00:32:24 CEST 2017 start 3
[loop-15] Mon Oct 2 00:32:39 CEST 2017 start 3
[loop-4] Mon Oct 2 00:32:48 CEST 2017 start 3
[loop-0] Mon Oct 2 00:32:48 CEST 2017 start 3
[loop-14] Mon Oct 2 00:33:02 CEST 2017 start 3
> [loop-15] Mon Oct 2 00:34:17 CEST 2017 build failed
> [loop-15] TIME TO FAIL: 33806 s

Test was run under Linux Mint 18.2 KDE, Kernel 4.11.0-14, no OC, all settings in EFI at default, with latest EFI downloaded as of today.
RAM 64GB (4x16) at default 2133 MHz; Mobo: x370 Taichi
Number of parallel processes: 16; Test duration: 9h 02min; Number of Fails:12

I also have another Ryzen chip in my server waiting to be tested, R3 1200 with batch number UA 1724PGT. Tomorrow will be a long day :tired_face:

2 Likes

ouch. that fucking sucks

Do you have 32GB of RAM?

You might be running out of memory. If the ramdisk says 64 capacity / 64 utilized then the test will fail once it reaches that mem cap.

df -h

I point it out because I just so happen to have a 1700, and 32GB of RAM and received a replacement UA 1733SUS and it also runs out of memory at ~15,000 seconds, or 4-5 hrs every time, depending on OC settings when the script is at default settings.

There are some options on the ryzen-test github page that lower the mem usage to allow for testing using fewer GCC compiling instances for longer periods. The GCC test can manifest the concurrency bug in in a few minutes in the majority of cases.

1 Like

Low memory should not be an issue. I have 64GB.
I also got the first fails right after testing a minute, and then it continues with random fails.

It is defective then. :frowning:

If the crashes happen all at once with memory full, then it is memory, but if they are spread out with the first few very soon, then there is instability in the system.

Im gonna test it more hours over the night and tomorrow. If I still get fails, my only solution will be to ask for a replacement, I guess. The 1700 was bought end of April, the 1200 at the beginning of August.

cite the batch number, your test results, and this thread in your RMA request. should be approved quick. they even shipped my replacement as soon as I confirmed the defective one was in the mail (didn’t wait to receive+inspect).

was able to reproduce segfaults, but other times the script would just fail out (build failed) without segfault.

RMA’d and now with no segfaults (yet), but it runs for under a minute before failing. 8 2 and 4 4 seem to make no difference. On either chip, never got past about 5 minutes.

32gb ram ran through typical memtest to completion, no issues.

trying to determine the issue here.

ie, there is no [KERN] … segfault
just TIME TO FAIL

all build loops fail from 0 to n successively within a few seconds of each other. ~30 sec for 4 4 or 8 2, and 200 seconds or so for all threads, like clockwork.

well, running testRyzenGCC sh for now and without the same issues, and without a segfault…

still don’t know what’s up with the one on here… I would still like to determine why.

Instead of typing “./kill-ryzen.sh” and then when asked typing your sudo passphrase,
try it the other way by typing directly “sudo ./kill-ryzen.sh”. this should give you more details about the errors messages, at least for me it does.

I think it’s because of this:

Hint: You are currently not seeing messages from other users and the system.
Users in the ‘systemd-journal’ group can see all messages.

With “./kill-ryzen.sh” and then sudo:

[loop-2] Sat Oct 7 17:33:36 CEST 2017 build failed
[loop-2] TIME TO FAIL: 120 s

With “sudo ./kill-ryzen.sh”:

[loop-6] Sat Oct 7 17:44:30 CEST 2017 build failed
[loop-6] TIME TO FAIL: 149 s
[KERN] Oct 07 17:44:30 Gundam-Exia kernel: bash[9471]: segfault at 1973dbe ip 00007f29009cfcc2 sp 00007ffe5757c000 error 6 in libc-2.23.so[7f2900869000+1c0000]

Thanks noidea, reply is much appreciated.

I was using su - , then running, so I should still see. Keep in mind I was able to see a few segfaults anyway (just not always failing that way), the majority of the time it just failed out as described. Have yet to see a segfault after RMA also…

testRyzenGCC ran for about 2 days without issue.

now I’m back to playing with kill-ryzen, just to see if I can get through it, since that’s how I got segfaulted to begin with.

  • with ramdisk at 32G, it ran until that was exhausted, which was much much longer than 64G ever got… shrug.
  • with ramdisk false and it went a while, but I backed out to start fresh:
  • everything in user’s home dir instead of root’s, default ramdisk, and it seems like it’s going past the usual threshold… this time using “sudo ./kill-ryzen.sh”.

maybe there’s a clue in there as to what was happening… the reliability in time to failure seemed like a setup issue to me, which only changed dramatically with the above changes.

Have sent my 1700 to AMD to the Netherlands for replacement… AMD paid the express delivery costs, and now I am waiting for their e-mail.

Now with my remaining R3 1200 (Batch UA 1724PGT) I don’t know what to do:
I did run the kill-ryzen.sh script 7 times, with a total testing time of 74 hours. And during this time I’ve got 2 seg-faults.
I don’t know. Is this a good or a bad cpu now?

(in comparison the R7 1700 (Batch UA 1709PGT):
5 runs, 47 hours total, 37 segfaults)

@noidea - since you were kind enough to reply to me earlier.

You should not expect to see segfaults, however you may never experience any real world issues, depending on what you do with it. It isn’t clear how the issue may manifest itself otherwise.

Since they are paying for shipping, you may just want to RMA it.

1 Like

Thank you gip!

Then I’ll RMA the 1200 just like the 1700.

How’s your Ryzen chip?

it’s ok so far, I haven’t been able to reproduce the issue since, and the other issues testing seemed to be resolved by doing what I described. I didn’t really investigate any further.

I hope you have good luck with things from here on.

I had no idea when my Ryzen 7 1700 was made. So I ran ryzen-kill.sh to see what would happen. It failed a few minutes in.

So I disassembled my system and found my chip was made in the 33rd week of 2017 (UA 1733PGS).

So I started playing with various settings as if I were overclocking the system. The longest run I’ve achieved so far is 28,581s (just under 8 hours). That was with vCore at 1.36875. I also tried 1.375. That ran for 5,903s. 1.3875 only lasted 237s.

I’m running with two sticks of Kingston ECC ValuRam (KVR24E17D8/16). Last time I checked, that memory isn’t on the QVL list for my motherboard (ASRock X370 Taichi).

Suggestions?

you should be good.

8 hours should be plenty of time


might try the blender build if your worried

Finally got my replacement for the 1700.
I have run the kill script multiple times on my new CPU. I have to point out that the behaviour changed.
Now I don’t get any errors.
BUT, the kill script doesn’t loop anymore. Means, I execute it, it keeps running till all my 16 threads are done compiling (about 7 hours), then it stops immidiately. No loop.

Before on the faulthy CPU:
I execute, the script runs and produces errors; indifferent a specific thread has produced an error or not, all threads are looping everytime after finishing compiling clean or with error, the script runs ENDLESS.

It turns out my newer 1700X has the segfault bug as well. I tested it in VirtualBox on Ubuntu 17.04 using the kill script.

I might as well wait two months though and grab a 2700X instead of a 1700X now.

1 Like