I run ESXi 7 with no issues.
It seems I have finally found a working configuration. These are the BIOS settings I changed. Now I have to check one by one which ones are really neded.
- Advanced
- CPU Configuration
- PSS Support -> Disabled (Performance State, Cool&Quiet)
- CPB Mode -> Disabled (Core Performance Boost)
- C6 Mode -> Disabled
- CPU Configuration
- AMD CBS
- CPU Common Options
- Platform First Error Handling -> Disabled
- Core Performance Boost -> Disabled
- Power Supply Idle Control -> Typical Current Idle
- ACPI _CST C1 Declaration -> Disabled
- NBIO Common Options
- IOMMU -> Enabled
- ACS Enable -> Enable
- PCIE ARI Support -> Enable
- Enable AER Cap -> Enable
- CPU Common Options
Thanks again for the help!
@RoLee What PSU/PSUs did you use when you were experiencing issues?
Maybe we are seeing something similar to 2013 Intel Hasswell situation here:
- https://techreport.com/news/24738/few-psus-support-haswells-c6c7-low-power-states/
- https://techreport.com/review/24897/the-big-haswell-psu-compatibility-list/
At that time many PSUs started appearing with ‘Haswell-ready’ sticker
Or it’s just what @Mastic_Warrior mentioned in this old thread: Ryzen C-State related problems -- what is the root cause?
@Tenrag Hmm, interesting.
Currently it is running with a "Seasonic Prime Ultra Platinum 550W ATX 2.4 (SSR-550PD2)"
/seasonic.com/prime-ultra-platinum
The other model I tested it with was an "Enermax Modu 82+ (EMD425AWT)"
/www.enermax.com/home.php?fn=eng/product_a1_1_1&lv0=1&lv1=54&no=7
Drop a support email to Seasonic and ASRock Rack, they know about PSU compatibility issues with this motherboard (just search the first half of this thread), but I seem to be black-listed and never received a resolution regarding the issues I had reported.
In the meantime I’m using SFX-L SilverStone Titanium PSUs, they don’t trigger anything (so far).
With BMC version 1.90 it seems to have become much better but I still don’t trust the X470D4Us with my Seasonic PSUs
I’m happy to report that, after disabling “Platform First Error Handling (PFEH)” in the BIOS, (corrected) single-bit are properly reported to the OS, also when overclocking / undervolting! So I’m now getting the same results as Diversity with his memory pin shorting method…
The reason I was failing to detect this earlier was:
- Memtest86 v8.2 reported “unknown” for ECC support. Memtest86 v8.3 reported “enabled” for ECC support. So I assumed, if it was working, Memtest86 v8.3 should be able to detect them. However, a couple of days ago I figured out that Memtest86 v8.4 beta had Zen2 ECC support in its changelog. So after testing I figured out that Memtest86 v8.3 does NOT support Zen2 ECC, but Memtest86 v8.4 beta DOES support Zen2 ECC.
- I only discovered the BIOS option “Platform First Error Handling (PFEH)” very recently. During all my previous testing, except for the very last couple short tests, it was set to the default “enabled”. I probably did too little testing with Linux / Windows after disabling it.
So in short:
- (corrected) single-bit memory errors -> motherboard (BIOS) -> OS ==> works 100%
- (corrected) single-bit memory errors -> motherboard (BIOS) -> OS -> IPMI ==> not sure if the OS properly forwards the error to the IPMI
- (corrected) single-bit memory errors -> motherboard (BIOS) -> IPMI ==> 100% broken
- (corrected) single-bit memory errors -> motherboard (BIOS) -> IPMI -> OS ==> 100% broken
- (uncorrected) multi-bit memory errors -> * ==> I’m not sure if it is broken (or perhaps not even possible on Zen2) or if we just haven’t been able trigger them yet. I’ve ran Memtest86 v8.4 with unstable memory for many hours now. In doing so, I’ve triggered about 3000 “ECC Correctable Errors” (=single-bit) and about 100 of CPU errors, but 0 “ECC Correctable Errors” (=multi-bit). Also using the shorting-method, we haven’t achieved any “ECC Correctable Errors” (=multi-bit) yet. We are currently in contact with the persons who wrote the paper (see link above for details) that explained the shorting-method, to see how to trigger multi-bit errors reliably.
So if I understand it correctly
- = ok
- I think we can only validate this once 3) is fixed
- Is actually a bug and should be fixed by Asrock Rack (with help of AMD and perhaps the IPMI-chip manufacturer Aspeed)
- Can only work / be fixed once 3) gets fixed
- Suggestions are welcome. Perhaps AMD can confirm if Zen2 properly supports this? But not like AMD TW claimed that “reporting is not supported”, which we now clearly proved to be false
I’ve send this information to Asrock Rack + AMD. Asrock Rack, on the same day, confirmed that they, together with AMD, had come to the exact same conclusion (using error injection in Linux) and that they asked AMD for assistance to report these errors in the IPMI as well. So hopefully we’ll someday get this important feature on this motherboard!!
In meantime, me and (especially) Diversity, are still trying to trigger (uncorrected) multi-bit errors as well. We’re in contact with the interesting folks of ECCploit for this, who have a very profound knowledge on this matter… (Check out Lucians talk on OffensiveCon19 https://www.youtube.com/watch?v=R2aPo_wwmZw).
Finally some real progress on this matter! Thanks to Diversity for getting my hopes up again, cause I almost gave up on this…
PFEH setting in BIOS
Screenshot taken after 1m33sec with memory overclocked / undervolted
screenshot after almost 2h after ending the run with memory overclocked / undervolted
And screenshot in Linux with memory overclocked / undervolted (during memtester run in the background):
3rd working method
This is how Asrock Rack and AMD tested it
The BIOS
Everything default except
- “Platform First Error Handling” was changed from the default “Enabled” to “Disabled”
- “Disable Memory Error Injection”, strangely enough, was (accidently) set to the default “True”. I think Asrock Rack fell for their own double negation confusion [image] I haven’t tried it yet with set to false. I also haven’t retried Memtest86 error injection with these settings again…
The OS
I used a fresh install of “Fedora-Server-dvd-x86_64-32-1.6.iso” for this. I might have selected a few additional package groups during the install, not sure if it will make a difference to the below instructions.
[root@localhost mce-inject-master]# cat /etc/fedora-release
Fedora release 32 (Thirty Two)
[root@localhost ~]# uname -r
5.6.8-300.fc32.x86_64
Installing / configuring additional packages / tools
edac-utils
[root@localhost ~]# yum install edac-utils
Fedora 32 openh264 (From Cisco) - x86_64 4.8 kB/s | 5.1 kB 00:01
Fedora Modular 32 - x86_64 2.2 MB/s | 4.9 MB 00:02
Fedora Modular 32 - x86_64 - Updates 881 kB/s | 1.4 MB 00:01
Fedora 32 - x86_64 - Updates 4.1 MB/s | 7.8 MB 00:01
Fedora 32 - x86_64 4.3 MB/s | 70 MB 00:16
Dependencies resolved.
======================================================================================================================== Package Architecture Version Repository Size
========================================================================================================================Installing:
edac-utils x86_64 0.16-22.fc32 fedora 49 k
Installing dependencies:
sysfsutils x86_64 2.1.0-28.fc32 fedora 44 k
Transaction Summary
========================================================================================================================Install 2 Packages
Total download size: 93 k
Installed size: 238 k
Is this ok [y/N]: y
Downloading Packages:
(1/2): edac-utils-0.16-22.fc32.x86_64.rpm 406 kB/s | 49 kB 00:00
(2/2): sysfsutils-2.1.0-28.fc32.x86_64.rpm 337 kB/s | 44 kB 00:00
------------------------------------------------------------------------------------------------------------------------Total 115 kB/s | 93 kB 00:00
warning: /var/cache/dnf/fedora-558931b5e76b51a7/packages/edac-utils-0.16-22.fc32.x86_64.rpm: Header V3 RSA/SHA256 Signature, key ID 12c944d0: NOKEY
Fedora 32 - x86_64 1.6 MB/s | 1.6 kB 00:00
Importing GPG key 0x12C944D0:
Userid : "Fedora (32) <[email protected]>"
Fingerprint: 97A1 AE57 C3A2 372C CA3A 4ABA 6C13 026D 12C9 44D0
From : /etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-32-x86_64
Is this ok [y/N]: y
Key imported successfully
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : sysfsutils-2.1.0-28.fc32.x86_64 1/2
Installing : edac-utils-0.16-22.fc32.x86_64 2/2
Running scriptlet: edac-utils-0.16-22.fc32.x86_64 2/2
Verifying : edac-utils-0.16-22.fc32.x86_64 1/2
Verifying : sysfsutils-2.1.0-28.fc32.x86_64 2/2
Installed:
edac-utils-0.16-22.fc32.x86_64 sysfsutils-2.1.0-28.fc32.x86_64
Complete!
Bison
[root@localhost mce-inject-master]# yum install bison
Last metadata expiration check: 0:06:26 ago on Fri 08 May 2020 12:45:14 AM CEST.
Dependencies resolved.
=============================================================================================================================================================================================================================================
Package Architecture Version Repository Size
=============================================================================================================================================================================================================================================
Installing:
bison x86_64 3.5-2.fc32 fedora 818 k
Installing dependencies:
m4 x86_64 1.4.18-12.fc32 fedora 218 k
Transaction Summary
=============================================================================================================================================================================================================================================
Install 2 Packages
Total download size: 1.0 M
Installed size: 3.0 M
Is this ok [y/N]: y
Downloading Packages:
(1/2): m4-1.4.18-12.fc32.x86_64.rpm 946 kB/s | 218 kB 00:00
(2/2): bison-3.5-2.fc32.x86_64.rpm 2.0 MB/s | 818 kB 00:00
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 1.2 MB/s | 1.0 MB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : m4-1.4.18-12.fc32.x86_64 1/2
Installing : bison-3.5-2.fc32.x86_64 2/2
Running scriptlet: bison-3.5-2.fc32.x86_64 2/2
Verifying : bison-3.5-2.fc32.x86_64 1/2
Verifying : m4-1.4.18-12.fc32.x86_64 2/2
Installed:
bison-3.5-2.fc32.x86_64 m4-1.4.18-12.fc32.x86_64
Complete!
Flex
[root@localhost mce-inject-master]# yum install flex
Last metadata expiration check: 0:06:41 ago on Fri 08 May 2020 12:45:14 AM CEST.
Dependencies resolved.
=============================================================================================================================================================================================================================================
Package Architecture Version Repository Size
=============================================================================================================================================================================================================================================
Installing:
flex x86_64 2.6.4-4.fc32 fedora 318 k
Transaction Summary
=============================================================================================================================================================================================================================================
Install 1 Package
Total download size: 318 k
Installed size: 927 k
Is this ok [y/N]: y
Downloading Packages:
flex-2.6.4-4.fc32.x86_64.rpm 1.5 MB/s | 318 kB 00:00
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 482 kB/s | 318 kB 00:00
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : flex-2.6.4-4.fc32.x86_64 1/1
Running scriptlet: flex-2.6.4-4.fc32.x86_64 1/1
Verifying : flex-2.6.4-4.fc32.x86_64 1/1
Installed:
flex-2.6.4-4.fc32.x86_64
Complete!
Rasdaemon
[root@localhost mce-inject-master]# yum install rasdaemon
Last metadata expiration check: 0:02:51 ago on Fri 08 May 2020 12:52:01 AM CEST.
Dependencies resolved.
=============================================================================================================================================================================================================================================
Package Architecture Version Repository Size
=============================================================================================================================================================================================================================================
Installing:
rasdaemon x86_64 0.6.4-1.fc32 fedora 117 k
Installing dependencies:
perl-DBD-SQLite x86_64 1.64-4.fc32 fedora 196 k
perl-DBI x86_64 1.643-2.fc32 fedora 707 k
perl-Math-BigInt noarch 1:1.9998.18-2.fc32 fedora 190 k
perl-Math-Complex noarch 1.59-452.fc32 fedora 56 k
Transaction Summary
=============================================================================================================================================================================================================================================
Install 5 Packages
Total download size: 1.2 M
Installed size: 3.5 M
Is this ok [y/N]: y
Downloading Packages:
(1/5): perl-Math-BigInt-1.9998.18-2.fc32.noarch.rpm 497 kB/s | 190 kB 00:00
(2/5): perl-DBD-SQLite-1.64-4.fc32.x86_64.rpm 498 kB/s | 196 kB 00:00
(3/5): perl-Math-Complex-1.59-452.fc32.noarch.rpm 1.3 MB/s | 56 kB 00:00
(4/5): rasdaemon-0.6.4-1.fc32.x86_64.rpm 1.1 MB/s | 117 kB 00:00
(5/5): perl-DBI-1.643-2.fc32.x86_64.rpm 1.1 MB/s | 707 kB 00:00
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 1.1 MB/s | 1.2 MB 00:01
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : perl-Math-Complex-1.59-452.fc32.noarch 1/5
Installing : perl-Math-BigInt-1:1.9998.18-2.fc32.noarch 2/5
Installing : perl-DBI-1.643-2.fc32.x86_64 3/5
Installing : perl-DBD-SQLite-1.64-4.fc32.x86_64 4/5
Installing : rasdaemon-0.6.4-1.fc32.x86_64 5/5
Running scriptlet: rasdaemon-0.6.4-1.fc32.x86_64 5/5
Verifying : perl-DBD-SQLite-1.64-4.fc32.x86_64 1/5
Verifying : perl-DBI-1.643-2.fc32.x86_64 2/5
Verifying : perl-Math-BigInt-1:1.9998.18-2.fc32.noarch 3/5
Verifying : perl-Math-Complex-1.59-452.fc32.noarch 4/5
Verifying : rasdaemon-0.6.4-1.fc32.x86_64 5/5
Installed:
perl-DBD-SQLite-1.64-4.fc32.x86_64 perl-DBI-1.643-2.fc32.x86_64 perl-Math-BigInt-1:1.9998.18-2.fc32.noarch perl-Math-Complex-1.59-452.fc32.noarch rasdaemon-0.6.4-1.fc32.x86_64
Complete!
[root@localhost machinecheck0]# rasdaemon -e
rasdaemon: ras:mc_event event enabled
rasdaemon: ras:aer_event event enabled
rasdaemon: mce:mce_record event enabled
rasdaemon: Can't write to set_event
rasdaemon: devlink:devlink_health_report event enabled
rasdaemon: block:block_rq_complete event enabled
[root@localhost machinecheck0]# systemctl start rasdaemon
[root@localhost machinecheck0]# systemctl enable rasdaemon
Created symlink /etc/systemd/system/multi-user.target.wants/rasdaemon.service → /usr/lib/systemd/system/rasdaemon.service.
[root@localhost machinecheck0]# systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; enabled; vendor preset: disabled)
Active: active (running) since Fri 2020-05-08 00:57:46 CEST; 23s ago
Main PID: 33914 (rasdaemon)
Tasks: 1 (limit: 38389)
Memory: 7.1M
CPU: 10ms
CGroup: /system.slice/rasdaemon.service
└─33914 /usr/sbin/rasdaemon -f -r
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: rasdaemon: diskerror_eventstore: 0x564510eb9918
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: rasdaemon: register inserted at db
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: overriding event (1360) ras:mc_event with new print handler
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: overriding event (1357) ras:aer_event with new print handler
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: overriding event (114) mce:mce_record with new print handler
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: overriding event (1441) net:net_dev_xmit_timeout with new print handler
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: overriding event (1449) devlink:devlink_health_report with new print handler
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: overriding event (1154) block:block_rq_complete with new print handler
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: Calling ras_mc_event_opendb()
May 08 00:57:46 localhost.localdomain rasdaemon[33914]: <...>-36 [005] 0.000095: block_rq_complete: 2020-05-08 00:57:45 +0200
Development Tools (for make)
[root@localhost mce-inject-master]# yum groupinstall "Development Tools"
Last metadata expiration check: 0:07:50 ago on Fri 08 May 2020 12:52:01 AM CEST.
Dependencies resolved.
=============================================================================================================================================================================================================================================
Package Architecture Version Repository Size
=============================================================================================================================================================================================================================================
Installing group/module packages:
diffstat x86_64 1.63-2.fc32 fedora 43 k
...
xorg-x11-server-utils x86_64 7.7-34.fc32 fedora 188 k
Installing weak dependencies:
kernel-devel x86_64 5.6.8-300.fc32 updates 13 M
Installing Groups:
Development Tools
Transaction Summary
=============================================================================================================================================================================================================================================
Install 79 Packages
Total download size: 124 M
Installed size: 448 M
Is this ok [y/N]: y
Downloading Packages:
(1/79): git-2.26.2-1.fc32.x86_64.rpm 787 kB/s | 126 kB 00:00
...
(79/79): xorg-x11-fonts-ISO8859-1-100dpi-7.5-24.fc32.noarch.rpm 2.6 MB/s | 1.0 MB 00:00
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total 6.9 MB/s | 124 MB 00:18
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
Preparing : 1/1
Installing : urw-base35-fonts-common-20170801-14.fc32.noarch 1/79
...
Running scriptlet: diffstat-1.63-2.fc32.x86_64 79/79
Verifying : cpp-10.0.1-0.14.fc32.x86_64 1/79
...
Verifying : xorg-x11-server-utils-7.7-34.fc32.x86_64 79/79
Installed:
adobe-mappings-cmap-20171205-7.fc32.noarch adobe-mappings-cmap-deprecated-20171205-7.fc32.noarch adobe-mappings-pdf-20180407-5.fc32.noarch binutils-2.34-2.fc32.x86_64
binutils-gold-2.34-2.fc32.x86_64 boost-filesystem-1.69.0-15.fc32.x86_64 boost-system-1.69.0-15.fc32.x86_64 boost-thread-1.69.0-15.fc32.x86_64
cpp-10.0.1-0.14.fc32.x86_64 diffstat-1.63-2.fc32.x86_64 doxygen-1:1.8.17-2.fc32.x86_64 dyninst-10.1.0-5.fc32.x86_64
gcc-10.0.1-0.14.fc32.x86_64 gd-2.3.0-1.fc32.x86_64 git-2.26.2-1.fc32.x86_64 git-core-2.26.2-1.fc32.x86_64
git-core-doc-2.26.2-1.fc32.noarch glibc-devel-2.31-2.fc32.x86_64 glibc-headers-2.31-2.fc32.x86_64 google-droid-sans-fonts-20200215-3.fc32.noarch
graphviz-2.42.4-1.fc32.x86_64 gtk2-2.24.32-7.fc32.x86_64 gts-0.7.6-37.20121130.fc32.x86_64 guile22-2.2.6-4.fc32.x86_64
isl-0.16.1-10.fc32.x86_64 jbig2dec-libs-0.17-4.fc32.x86_64 kernel-devel-5.6.8-300.fc32.x86_64 kernel-headers-5.6.7-300.fc32.x86_64
lasi-1.1.3-2.fc32.x86_64 libXaw-1.0.13-14.fc32.x86_64 libXmu-1.1.3-3.fc32.x86_64 libXpm-3.5.13-2.fc32.x86_64
libXt-1.2.0-1.fc32.x86_64 libfontenc-1.1.3-12.fc32.x86_64 libgs-9.52-1.fc32.x86_64 libidn-1.35-7.fc32.x86_64
libijs-0.35-11.fc32.x86_64 libimagequant-2.12.6-2.fc32.x86_64 libmcpp-2.7.2-25.fc32.x86_64 libmpc-1.1.0-8.fc32.x86_64
libpaper-1.1.24-26.fc32.x86_64 libraqm-0.7.0-5.fc32.x86_64 librsvg2-2.48.4-1.fc32.x86_64 libserf-1.3.9-15.fc32.x86_64
libwebp-1.1.0-2.fc32.x86_64 libxcrypt-devel-4.4.16-3.fc32.x86_64 make-1:4.2.1-16.fc32.x86_64 mcpp-2.7.2-25.fc32.x86_64
netpbm-10.90.00-1.fc32.x86_64 openjpeg2-2.3.1-6.fc32.x86_64 patch-2.7.6-12.fc32.x86_64 patchutils-0.3.4-15.fc32.x86_64
perl-Error-1:0.17029-1.fc32.noarch perl-Git-2.26.2-1.fc32.noarch perl-TermReadKey-2.38-6.fc32.x86_64 subversion-1.12.2-7.fc32.x86_64
subversion-libs-1.12.2-7.fc32.x86_64 systemtap-4.3-0.20200211git91ffb97ad335.fc32.x86_64 systemtap-client-4.3-0.20200211git91ffb97ad335.fc32.x86_64 systemtap-devel-4.3-0.20200211git91ffb97ad335.fc32.x86_64
systemtap-runtime-4.3-0.20200211git91ffb97ad335.fc32.x86_64 tbb-2020.2-1.fc32.x86_64 urw-base35-bookman-fonts-20170801-14.fc32.noarch urw-base35-c059-fonts-20170801-14.fc32.noarch
urw-base35-d050000l-fonts-20170801-14.fc32.noarch urw-base35-fonts-20170801-14.fc32.noarch urw-base35-fonts-common-20170801-14.fc32.noarch urw-base35-gothic-fonts-20170801-14.fc32.noarch
urw-base35-nimbus-mono-ps-fonts-20170801-14.fc32.noarch urw-base35-nimbus-roman-fonts-20170801-14.fc32.noarch urw-base35-nimbus-sans-fonts-20170801-14.fc32.noarch urw-base35-p052-fonts-20170801-14.fc32.noarch
urw-base35-standard-symbols-ps-fonts-20170801-14.fc32.noarch urw-base35-z003-fonts-20170801-14.fc32.noarch utf8proc-2.4.0-3.fc32.x86_64 xapian-core-libs-1.4.14-1.fc32.x86_64
xorg-x11-font-utils-1:7.5-44.fc32.x86_64 xorg-x11-fonts-ISO8859-1-100dpi-7.5-24.fc32.noarch xorg-x11-server-utils-7.7-34.fc32.x86_64
Complete!
mce-inject
[root@localhost ~]# wget https://github.com/andikleen/mce-inject/archive/master.zip
--2020-05-08 00:49:09-- https://github.com/andikleen/mce-inject/archive/master.zip
Resolving github.com (github.com)... 140.82.118.3
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/andikleen/mce-inject/zip/master [following]
--2020-05-08 00:49:09-- https://codeload.github.com/andikleen/mce-inject/zip/master
Resolving codeload.github.com (codeload.github.com)... 140.82.114.9
Connecting to codeload.github.com (codeload.github.com)|140.82.114.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘master.zip’
master.zip [ <=> ] 13.21K --.-KB/s in 0.09s
2020-05-08 00:49:10 (139 KB/s) - ‘master.zip’ saved [13530]
[root@localhost ~]# unzip master.zip
Archive: master.zip
4cbe46321b4a81365ff3aafafe63967264dbfec5
creating: mce-inject-master/
inflating: mce-inject-master/Makefile
inflating: mce-inject-master/README
inflating: mce-inject-master/inject.h
inflating: mce-inject-master/mce-inject.8
inflating: mce-inject-master/mce-inject.c
inflating: mce-inject-master/mce.h
inflating: mce-inject-master/mce.lex
inflating: mce-inject-master/mce.y
inflating: mce-inject-master/parser.h
creating: mce-inject-master/test/
inflating: mce-inject-master/test/corrected
inflating: mce-inject-master/test/fatal
inflating: mce-inject-master/test/uncorrected
inflating: mce-inject-master/util.c
inflating: mce-inject-master/util.h
[root@localhost ~]# cd mce-inject-master/
[root@localhost mce-inject-master]# ls -la
total 48
drwxr-xr-x. 3 root root 189 Jan 19 2013 .
drwxr-xr-x. 3 root root 49 May 8 00:49 ..
-rw-r--r--. 1 root root 193 Jan 19 2013 inject.h
-rw-r--r--. 1 root root 904 Jan 19 2013 Makefile
-rw-r--r--. 1 root root 3863 Jan 19 2013 mce.h
-rw-r--r--. 1 root root 3793 Jan 19 2013 mce-inject.8
-rw-r--r--. 1 root root 6506 Jan 19 2013 mce-inject.c
-rw-r--r--. 1 root root 3487 Jan 19 2013 mce.lex
-rw-r--r--. 1 root root 3822 Jan 19 2013 mce.y
-rw-r--r--. 1 root root 385 Jan 19 2013 parser.h
-rw-r--r--. 1 root root 1460 Jan 19 2013 README
drwxr-xr-x. 2 root root 55 Jan 19 2013 test
-rw-r--r--. 1 root root 364 Jan 19 2013 util.c
-rw-r--r--. 1 root root 290 Jan 19 2013 util.h
[root@localhost mce-inject-master]# make
bison -d mce.y
flex mce.lex
cc -MM -DDEPS_RUN -I. mce-inject.c util.c mce.tab.c lex.yy.c > .depend.X && \
mv .depend.X .depend
cc -Os -g -Wall -c -o mce-inject.o mce-inject.c
cc -Os -g -Wall -c -o mce.tab.o mce.tab.c
cc -Os -g -Wall -c -o lex.yy.o lex.yy.c
cc -Os -g -Wall -c -o util.o util.c
cc -pthread mce-inject.o mce.tab.o lex.yy.o util.o -o mce-inject
[root@localhost mce-inject-master]# ls -la
total 400
drwxr-xr-x. 3 root root 4096 May 8 01:01 .
drwxr-xr-x. 3 root root 49 May 8 00:49 ..
-rw-r--r--. 1 root root 45 May 8 00:54 correct
-rw-r--r--. 1 root root 185 May 8 01:01 .depend
-rw-r--r--. 1 root root 193 Jan 19 2013 inject.h
-rw-r--r--. 1 root root 47534 May 8 01:01 lex.yy.c
-rw-r--r--. 1 root root 73320 May 8 01:01 lex.yy.o
-rw-r--r--. 1 root root 904 Jan 19 2013 Makefile
-rw-r--r--. 1 root root 3863 Jan 19 2013 mce.h
-rwxr-xr-x. 1 root root 84584 May 8 01:01 mce-inject
-rw-r--r--. 1 root root 3793 Jan 19 2013 mce-inject.8
-rw-r--r--. 1 root root 6506 Jan 19 2013 mce-inject.c
-rw-r--r--. 1 root root 38960 May 8 01:01 mce-inject.o
-rw-r--r--. 1 root root 3487 Jan 19 2013 mce.lex
-rw-r--r--. 1 root root 56619 May 8 01:01 mce.tab.c
-rw-r--r--. 1 root root 2922 May 8 01:01 mce.tab.h
-rw-r--r--. 1 root root 25552 May 8 01:01 mce.tab.o
-rw-r--r--. 1 root root 3822 Jan 19 2013 mce.y
-rw-r--r--. 1 root root 385 Jan 19 2013 parser.h
-rw-r--r--. 1 root root 1460 Jan 19 2013 README
drwxr-xr-x. 2 root root 55 Jan 19 2013 test
-rw-r--r--. 1 root root 364 Jan 19 2013 util.c
-rw-r--r--. 1 root root 290 Jan 19 2013 util.h
-rw-r--r--. 1 root root 8128 May 8 01:01 util.o
[root@localhost mce-inject-master]# modprobe mce_inject
[root@localhost mce-inject-master]# vi correct
[root@localhost mce-inject-master]# cat correct
CPU 1 BANK 2
STATUS corrected
RIP 0x12341234
Prevent the machine from crashing
[root@localhost mce-inject-master]# cd /sys/devices/system/machinecheck/machinecheck0
[root@localhost machinecheck0]# cat tolerant
1
[root@localhost machinecheck0]# vi tolerant
[root@localhost machinecheck0]# cat tolerant
3
[root@localhost machinecheck0]#
Check edac status
[root@localhost ~]# ls /sys/devices/system/edac/mc
mc0 power subsystem uevent
[root@localhost ~]# find /lib/modules/$(uname -r) -name '*edac*'
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/amd64_edac_mod.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/e752x_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/edac_mce_amd.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i10nm_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i3000_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i3200_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i5000_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i5100_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i5400_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i7300_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i7core_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/i82975x_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/ie31200_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/pnd2_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/sb_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/skx_edac.ko.xz
/lib/modules/5.6.8-300.fc32.x86_64/kernel/drivers/edac/x38_edac.ko.xz
[root@localhost ~]# edac-util -rfull
mc0:csrow2:mc#0csrow#2channel#0:CE:0
mc0:csrow2:mc#0csrow#2channel#1:CE:0
mc0:csrow3:mc#0csrow#3channel#0:CE:0
mc0:csrow3:mc#0csrow#3channel#1:CE:0
mc0:noinfo:all:UE:0
mc0:noinfo:all:CE:0
Inject the error and observe the result
[CODE][root@localhost mce-inject-master]# modprobe mce_inject
[root@localhost mce-inject-master]# ./mce-inject correct
[root@localhost mce-inject-master]#
Message from syslogd@localhost at May 8 01:02:10 ...
kernel:[Hardware Error]: Corrected error, no action required.
Message from syslogd@localhost at May 8 01:02:10 ...
kernel:[Hardware Error]: CPU:1 (17:71:0) MC2_STATUS[-|CE|-|-|-|-|-|-|-|-]: 0x9000000000000000
Message from syslogd@localhost at May 8 01:02:10 ...
kernel:[Hardware Error]: IPID: 0x0000000000000000
Message from syslogd@localhost at May 8 01:02:10 ...
kernel:[Hardware Error]: L2 Cache Ext. Error Code: 0, L2M Tag Multiple-Way-Hit error.
Message from syslogd@localhost at May 8 01:02:10 ...
kernel:[Hardware Error]: cache level: RESV, tx: INSN
[root@localhost mce-inject-master]# ras-mc-ctl --summary
No Memory errors.
No PCIe AER errors.
No Extlog errors.
No devlink errors.
Disk errors summary:
0:0 has 1 errors
MCE records summary:
1 Corrected error, no action required. errors
[root@localhost mce-inject-master]#
Thanks, that was a good read.
I also was digging a little into how ECC errors should be propagated and there are a few things:
-
ECC errors are reported with Machine Check Exceptions (MCE). Those exceptions are essentially just an event when the CPU populates MCA registers. From what I gathered the kernel MCA handler periodically polls for a change in those registers and will report any errors (or panic for example). I also read that MCEs are essentially interrupts similar to NMIs so I am not sure how it goes with the polling strategy.
-
AGESA decides if ECC should be enabled on a specific CPU and depending on that the BIOS can allow the OS/kernel to register a MCA handler.
(old agesa code: https://github.com/coreboot/coreboot/tree/master/src/vendorcode/amd/agesa)
Now the most interesting part:
- AFAIK BMC and the host OS do not talk with each other when it comes to MCEs. I contacted ASPEED support and they informed me that normally MCEs are reported to BMC chip through APML which is an I2C bus. (on AMD CPUs)
According to the APML spec the CPU exports the same set of MCA registers to the BMC as it does to the host OS. So when it comes to MCAs host OS and IPMI detect/log/report them independently.
AMD docs: https://developer.amd.com/resources/developer-guides-manuals/
APML spec: https://developer.amd.com/wordpress/media/2012/10/41918.pdf
It could be that APML is simply disabled for AM4 CPUs - I can’t find a definitive info but most of the AMPL marketing seems to point to EPYC exclusivity. At the same time its only 2 CPU pins (SIC and SID) and even AM2 had them:
https://en.wikichip.org/wiki/amd/packages/socket_am2
If this is true and we are simply missing APML link then not only ECC errors should be missing in the IPMI logs. Critical temperature events are also reported through this bus. Anyone saw overhat events in the IPMI log?
It is also possible that X470D4U simply did not wire those 2 pins
-
A reminder about mce-inject: Looking at the documentation - it is only for testing of MCA handlers in kernel - it is not for testing the whole platform. Be sure not to rely on it when it comes to simulating real ECC errors. The mce-inject code seems to suggest it is using EDAC driver inject points. You can find more info here: https://www.kernel.org/doc/html/latest/admin-guide/ras.html#edac-error-detection-and-correction
I believe ECC inject in the BIOS is a separate feature.
Edit: According to the previous arch BKDG (http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf) EDAC driver can support error injection using dedicated CPU registers. Don’t know if logs are different when compared to injecting errors just on kernel driver level. -
The host OS can communicate with IPMI (/dev/ipmi) so it should be possible to for example bypass the APML requirement in the MCA handler and communicate with IPMI directly (with “ipmitool event” like mechanism). I do not believe something like this is being done currently (at least not in the MCA handler kernel code). So I think what you wrote about OS forwarding is simply not supposed to happen now:
12 boards available at the time of writing this
Edit: Price seems to be going up, it started at ~247 Euro
Edit2: another 12 boards at itboost.de
Edit3: Aaand It’s gone
Edit4: The stock seems more stable now - several retailers list it and those that sold-out few days ago seem to be getting more units.
Hey guys,
I’m having some weird issues with this board that I was hoping you could help me figure out. I didn’t notice anyone with this exact issue but apologies if it’s already been mentioned.
To make a long story short I don’t think I can get to BIOS and my machine isnt POSTing.
The computer will automatically start itself if it’s plugged into power. If the computer is off and I plug in the PSU and turn the PSU on. The BMC light will be solid, then after about 10 seconds there is a click, the BMC heart beat starts and the machine starts to power up . Once it starts booting I can’t power it off by holding the power button or anything, I have to switch off the PSU.
I’ve connected a monitor over VGA and spammed Delete during this whole process, but can’t seem to access BIOS.
I am able to look at the machine over IPMI…kind of. I can log in and see the dashboard, but my MB model and BIOS version are blank. There are no alerts, some system sensors/fan data. when I go to system inventory I see, “System information will be refreshed when the system POST, please restart the system if you see nothing on screen.”
I can’t KVM into the machine, either JAVA of web, and I have no power control over the machine over IPMI. I can’t turn on or off the machine over IPMI, but if I try the KVM I can see the correct server power status on the top right. If I manually turn off the server when on IPMI I see the power switch go RED before I lose connection.
I was able to update to the most recent BMC version (1.9) but I can’t remote flash the BIOS. I am able to upload the file and get it to the point where it says, “preparing to flash bios” but it just hangs on processing. I think it’s because the IPMI can’t power on or off the computer.
I’m going to probably take it all apart tonight and reconnect every component, but curious if there’s anything you all can think of with this.
Thanks!!
Did it stop posting or it never posted since purchase?
Quite normal - there is a BIOS option for this behavior
how does Dr.Debug look? (the boot codes)
I don’t really know what you mean by a click. I don’t have anything like it.
About the BMC light: yup, first few seconds solid, then pulsing - sounds normal to me
Bad BIOS flash? Try flashing through IPMI again, but using a clean browser without any addons/adblockers. And be patient.
If the BIOS is busted then KVM won’t have much to show.
This is concerning, this should work even without functioning BIOS. Did it work in the past?
I had similar issues, and your suspicious aligns with my experience a a little. I once tried updating the BIOS through the IPMI while the machine was ON. Make sure it is OFF when flashing.
But since holding the power button doesn’t work then I have only one sugestion: Disconnect any and all front panel connectors for powier/reset/LEDs and use a screwdriver to short pins.
I once had a similar, very frustrating problem with a PC that was boot looping. It was a broken power button in a chassis.
Additional question: What components? CPU/RAM/PCIe cards/Drives
Edit:
2 more things:
- Try resetting the CMOS - instruction in the manual
- Make sure there are no misaligned standoffs or stray screws shorting anything on the back of the board
No, it’s never worked. This is a new comp so all components are new.
No boot codes come up, the display on the MB doesn’t light up or anything. The BMC and power ready green LEDs are one though, which I think is normal.
I let it run for like an hour and a half last night. I think the issue is since I can’t turn the machine power on or off (request times out) the IPMI can’t shut down the computer to flash the BIOS. I can’t try to flash the BIOS with the machine off because it will auto start if it’s connected to the PSU.
As far as I understand it, it is working. The cable is connected into the IPMI port, and then hits my router. I’ve been controlling it from my PC also hard wired into the router. I was able to update the BMC today to the current version, so I think it’s hopefully at least connected right. I have admittedly not heard of IPMI until I bought this board haha.
After work today I’m planning on resetting the CMOS and trying to mess with ram variations to see if it will boot. If it doesn’t work I’ll just have to take it apart and reseat everything and go from there.
Comp parts are…
PSU - EVGA SuperNova 650 G3 80+ Gold
RAM - 16GB Corsair Vengance LPX SDRAM DDR4 3000
Case - Fractal Design 804
CPU - AMD Ryzen 5 1600
MB - AsROck Rack x470d4u
NAS Storage - 2x 6gb WD RED
SSD/Cache - Kingston 240gb SSD
I meant the power control over IPMI.
If it were DDR fault you would most likely see it boot-loop a few times while cycling error codes. My current bet is shorted power/reset button, or just incorrectly connected front panel.
Nope, power over IPMI has never worked.
Hopefully it’s something silly like the power button being connected wrong.
Hopefully.
One more comment about:
CPU - AMD Ryzen 5 1600
1st gen isn’t on the supported CPU list but STH posted an article with tests including 1600 (non AF). So it shouldn’t be an issue: https://www.servethehome.com/amd-ryzen-5-1600-af-review-a-wildcard-option/2/
Also: X470D4U review on STH is finally up: https://www.servethehome.com/asrock-rack-x470d4u-review-amd-ryzen-meet-server/
Has anyone been able to get AMD-RAID w/ the two nvme slots to work when installing Centos/Redhat 8? It simply just shows both NVME disks, but not the amdraid I created in the bios.
Thanks
In my thread with the Intel P4500 NVMe issues @wendell mentions that there is no Linux driver support for an AMD NVMe RAID array so you’ll probably have to use general software RAID solutions from the intended distributions to get there.
AMD doesn’t have a native raid driver for Linux.
STH is going to be releasing a review on these motherboards soon.
I’d expect a very detailed review with a totally helpful score of 9.7
Was way off, only got an 8.9
The review is not really touching the various bugs of the motherboard so I’ll guess we’ll see a few new users here soon
OP raises from the grave
Ugh… wha… STH finally reviewed this? Cool.
reads review
Well that was a useless waste of time…
For the moment, until more widespread support materializes for the Ryzen platform, our recommendation for the X470D4U will come with a huge caveat; we like the platform, but you will have to test it for yourself to make sure the platform works for your organization (or you if this is a lab environment) and your particular set of applications.
They should have just hotlinked this thread and called it a day.