Calling all Linux Ryzen owners! Submit your CPU numbers (uCode/ProdDate)

This is in relation to the segfault & mce issues seen here: https://www.reddit.com/r/Amd/comments/6rtqj0/information_i_could_find_on_these_segfault_issues/

Some people have been reporting that Production Week 25 Silicon Chips do not experience the problem that Week 16 silicon chips do. But there is no confirmed evidence on this since I do not have the segfault issue with a very early production run chip.
In light of this I want to gain some insight on the diversity of microcode, or even steppings out there.

My Reddit thread here:

Since not everyone has a Reddit account, I’m recreating this post here:

Post the output of following command below and state your Motherboard and BIOS version:

How to get your CPU version and microcode

    grep 'stepping\|model\|microcode' /proc/cpuinfo | head -4

How to get BIOS Version

    dmesg | grep -e 'DMI.*BIOS'

Here’s my example:

    ASRock X370 Gaming K4, BIOS P3.00 07/07/2017

    model           : 1
    model name      : AMD Ryzen 7 1700X Eight-Core Processor
    stepping        : 1
    microcode       : 0x8001126

Edit

Reading your production date/batch number

This is a short guide on how to read your CPU’s batch number.
Note This requires taking off your CPU cooler, I currently don’t know of another way to gain this information.

Take this example image:

The text you see is explained as follows:

SKU: YD1700BBM88AE
BATCH: UA 1706PGT
SERIAL: 9R6xxxxxxxxxx

The Batch number consists of the UA and then a two part number of 2 digits each.

The first being the year the CPU was produced [17]=(2017) and the week [06] = (Week 6).

UA [2digits-YEAR][2digits-WEEK] [3letters]

Week 6 of 2017 means it was produced anywhere between February 6 to February 12.

Update (25th August 2017)

Google Sheets doc of affected & RMA chip UA numbers

Testing for this

Previously people used ryzen-kill script which compiled gcc with lots of threads, but that’s unreliable since it doesn’t stress the CPU power management as easily

My way of testing for this:

  1. Reset your BIOS to stock
  2. Install mprime (linux prime95 basically)
  3. Run mprime -t for a good few minutes to get the CPU quite warm.
  4. While mprime is running, attempt to compile pretty much anything a few times over.
    A good project to use is tesseract
git clone https://github.com/rigred/tesseract.git

## Install dependencies (sdl2, zlib, maybe a few more)

cd tesseract
## repeat the below a few times until it fails (probably immediately)
make -C src clean && make -C src -j16 install
8 Likes

Here is mine:

PRIME B350M-A, BIOS 0805 06/20/2017

model		: 1
model name	: AMD Ryzen 7 1700 Eight-Core Processor
stepping	: 1
microcode	: 0x8001126
2 Likes

I have the same microcode as you, ASRock AB350 Pro4, BIOS P3.00 7/13/2017

Ran into a few segfaults during earlier bioses (including 1.0.0.6-included 2.60), no MCEs, but haven’t since, though I also haven’t compiled much since. Having said that, for me they mostly disappeared when I upped vSoC to 1.05, memory to 1.375v (I run with 4x8 gb b3000c15 hynix chips). So on those early chips I think it was mostly due to luck, but you could still get functional chips, because their binning/QA process didn’t actively check for markers related to this, whereas afterwards, they fixed the underlying issue that could lead to the behavior / instability.

Is it safe to share microcode? Seems like a security issue waiting to happen?

I don’t want to be this guy. But your question demonstrates that you don’t know what you are talking about here. I don’t mean that in a bad way, It’s always good to ask questions.

This is not the microcode itself (you don’t have access to that anyway).
This is the version number of the microcode and BIOS and is not unique to you like a serial number would be.

4 Likes

model : 1
model name : AMD Ryzen 7 1700 Eight-Core Processor
stepping : 1
microcode : 0x8001126

X370 Taichi, BIOS P2.40 06/06/2017

I have updated the original post to include how to check your production batch number.

This is useful for those who have the segfault/mce issue and are not afraid to take their CPU cooler off. :smiley:

2 Likes

Yay or rather Nay.

After about 3 hours I got this:

[KERN] Aug 08 12:51:13 jupiter kernel: traps: as[4687] general protection ip:7f10bcd4428a sp:7ffd14de3248 error:0 in libc-2.25.so[7f10bcccd000+19d000]
[loop-15] Tue Aug 8 12:51:16 WAT 2017 build failed
[loop-15] TIME TO FAIL: 11179 s
[loop-2] Tue Aug 8 12:54:01 WAT 2017 build failed
[loop-2] TIME TO FAIL: 11344 s
[loop-11] Tue Aug 8 12:56:58 WAT 2017 build failed
[loop-11] TIME TO FAIL: 11521 s
[KERN] Aug 08 12:57:24 jupiter kernel: perf: interrupt took too long (2504 > 2500), lowering kernel.perf_event_max_sample_rate to 79800
[loop-6] Tue Aug 8 13:02:28 WAT 2017 build failed
[loop-6] TIME TO FAIL: 11851 s

I’m not certain what caused the first gp fault, but all the consecutive build failures where due to this:

checking for unsigned long long int... yes
checking for uintmax_t... no
checking for uintptr_t... no
configure: error: uint64_t or int64_t not found
make[2]: *** [Makefile:4348: configure-stage3-gcc] Error 1
make[2]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-15'
make[1]: *** [Makefile:27416: stage3-bubble] Error 2
make[1]: Leaving directory '/mnt/ramdisk/workdir/buildloop.d/loop-15'
make: *** [Makefile:942: all] Error 2

Going to debug the assembler now.

2 Likes

It would appear that I have almost unlocked the riddle of the AMD CPU Batch number.

http://phx.corporate-ir.net/External.File?item=UGFyZW50SUQ9NTk4MTgwfENoaWxkSUQ9MzA4NjM5fFR5cGU9MQ==&t=1

It indeed seems to that PGT chips are from the Penang, Malaysia and the SUS/SUT chips from Suzhou, China produced by NFME’s (Nantong Fujitsu MicroElectronics) ATMP (assembly, test, mark, and pack) facilities.

http://ir.amd.com/phoenix.zhtml?c=74093&p=irol-newsArticle&ID=2097683

http://www.marketwired.com/press-release/amd-nantong-fujitsu-microelectronics-co-ltd-close-on-semiconductor-assembly-test-joint-2119770.htm

The Die’s themselves are diffused in the US at Fab 8 Luther Forest Technology Campus, Saratoga County, New York.
https://translate.google.com/translate?hl=en&sl=auto&tl=en&u=http%3A%2F%2Fwww.bitsandchips.it%2F9-hardware%2F8147-il-100-delle-cpu-ryzen-sono-prodotte-presso-globalfoundries


The alternative is Samsung’s 14nm 300mm Fab in Austin Texas.

http://www.samsung.com/semiconductor/foundry/manufacturing/

This allows us to attribute the first and last letters.

    UA   1706       PGT
    UA [YY][WW]  [1][2][3]

    YY -> Year
    WW -> Week

    1 -> ATMP Location ([P]enang, Malaysia or [S]uzhou, China) (Exact factory addresses known )
    2 -> (G and U)? Perhaps [G]lobalfoundries but who is U?
    3 -> DIE Production ([S]aratoga or [T]exas) (addresses known)
5 Likes

Updated Topic with new details.

FYI - I just updated the BIOS on my X370 Taichi from 3.00 to 3.10, which updated the CPU microcode from 0x8001126 to 0x8001129.

I have a “bad” Ryzen CPU - it segfaults under high compilation load in Linux, though I haven’t retested since the BIOS upgrade.

Details:

model           : 1
model name      : AMD Ryzen 7 1800X Eight-Core Processor
stepping        : 1
microcode       : 0x8001129
[    0.000000] DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Taichi, BIOS P3.10 08/25/2017

CPU is from batch “UA 1711PGS”

1 Like

My chip:

ASUS PRIME X370-PRO, BIOS 0810 08/01/2017

model		    : 1
model name	    : AMD Ryzen 7 1700 Eight-Core Processor
stepping	    : 1
microcode	    : 0x8001126

SKU             : YD1700BBM88AE
BATCH           : UA  1707 SUT
SERIAL          : 9GU4765N70042

Gonna be RMAing it, it’s shitting the bed on blender builds. #sad

n = n + 1

small

UA 1713PGT

So Penang, Global Foundries, Texas

This chip is the one I had to RMA due to consistent concurrent workload errors. No MCE errors. Will update later with the one I get as a replacement.

2 Likes

Another one

model		: 1
model name	: AMD Ryzen 7 1700 Eight-Core Processor
stepping	: 1
microcode	: 0x8001126
DMI: Micro-Star International Co., Ltd MS-7A34/B350 PC MATE(MS-7A34), BIOS A.60 07/27/2017
Kernel: 4.12.6
ProdDate: 1728

kill-ryzen.sh fails first time after approx 82s . 4 others progressively fail and is stable to 2h at 11 threads running under kill-ryzen. These fail with either traps or segfault (varies).
I also run mythtv on this server which does a lot of IO to PCIE and USB for DVB EIT scanning. I notice I get more lines like

Sep 14 08:02:50 kernel: [88333.523310] usb 1-5: dvb_usb_af9015: command failed=2
Sep 14 08:02:50 kernel: [88333.523337] i2c i2c-5: af9013: i2c rd failed=-5 reg=d391 len=1
Sep 14 08:02:50 kernel: [88333.524139] usb 1-5: dvb_usb_af9015: command failed=1
Sep 14 08:02:50 kernel: [88333.524163] i2c i2c-5: af9013: i2c wr failed=-5 reg=d391 len=1

These don’t happen much until things are under load. This is indicative of the RETIQ issue mentioned I believe. When it starts the M/B NIC starts behaving badly.
Lots of building Qt will make this happen too.
Since the proddate is post 1725 I shouldn’t be seeing the kill ryzen issues according to all the advice I’ve seen posted. Is it worth persuing AMD?
no ucode update either ATM.
NOTE: all BIOS defaults, no overclocking, RAM (GSKILL RipjawsV 2666 @2400 but tried the others too).
One PCIE card (DVICO DUAL DIGITAL PCIE) is not even detected by this M/B. MS were no help there.
Another in windows running fine but no real load.
Other than this the ryzen is great.
Doing a PS swap shortly.
HTH anyone/someone.

The kill-ryzen script just tests for instability. There are also some outlined general steps in this other thread. If you want, you can have the Ryzen tech support walk you through double-checking everything (cooling/voltages/prime95/jdec memory/vql memory). In my case (pun intended), there were some intermittent i/o errors which were fixed with a different sata cable + port.

Recently I RMAd my Ryzen 1700 for segfault issues. After a long wait I finally received my replacement, a week 33 chip.

Using kill-ryzen.sh on ubuntu 16.0.4.3 with latest AEGESA, stock RAM speeds, memtest validated, had the following…

The good:

  • No segfault
  • No MCE reboot or system crashes

The bad?

  • “interrupt took too long…” message followed by failed builds.

I’m not familiar with this message. Does this suggest another bad chip or am I facing a kernel issue with new architecture blues?

Attached is a screenshot.

hmmm…

kernel version?
uname -a
EDIT: Ok I saw there it’s 4.10.0.35. Could you perhaps try to upgrade to a recent 4.12 kernel and see if things improve?

Aside from that what are the mainboard and other system details?

Also attach dmesg output when it occurs.

PS: I suspect you might simply be hitting the limit of RAM.

Regarding RAM, I agree.

FWIW I ran this test so far for 2 hours without issues. This segfault test doesn’t use up nearly as much RAM. It seems to be a valid test. AMD forum users report typical failure from 10-20min.


./run.sh 16 1000000000

Linux tiny-ryzen 4.10.0-35-generic #39~16.04.1-Ubuntu SMP Wed Sep 13 09:02:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ASRock Fatal1ty x370 ITX

I’ll certainly look into an upgraded kernel.

Your help and ideas is much appreciated!

Check your build log

cat /mnt/ramdisk/workdir/buildloop.d/loop-0/build.log | grep -i error

I bet all the logs show the same. Error

./md-unwind-support.h:65:47: error: dereferencing pointer to incomplete type 'struct ucontext'

The kill-ryzen test is broken when attempting to build with newer gcc like 7.2.0 and the associated system dependencies.
You can manually patch the build script to download and build gcc 7.2.0 instead.