ASRock Rack has created the first AM4 socket server boards, X470D4U, X470D4U2-2T

Depends on a BIOS version it came with. There should be a sticker with a version on the chip itself - next to the m.2 slots. (At least I have one on my board)
Ryzen 3600 is supported since version P3.10

I am seeing this kind of comments for some time now. Maybe it’s just not flashing properly and people don’t know what it should look like.

This is how the procedure should look like. In this case from 3.30 to 3.33:
(Running in a clean chromium - just installed, clean profile, no addons, adblock, custom settings, etc.)

Server is off - it must be off!
see the ‘Host is currently off’ note in the power control tab


Current version is 3.30

the update page

chose the 3.33 bios

start uploading, for me it was ‘stuck’ at 99% for about a minute

the upload part finished, propt to continue

flashing in progress

finished, click ok and login again

ipmi still shows 3.30, thats ok, we didn’t post yet

power on, go into the bios - it will show 3.33,
then power off and go back to the ipmi, now it shows 3.33:

@JediAcolyte Thank you.

Hello! Do you still have the X470D4U available? :slight_smile:

Sure do, shoot me a PM.

1 Like

@cybrnook Either I’m too dumb to find the PM option, or I actually can’t PM you (probably because I’m new in the forums?) :thinking:

Hi everyone,

It’s been awhile ago, but I have another update…

I’ll start with a summary and then post all “evidence”. This is all regarding the platform ASRock Rack X470D4U2-2T + AMD Ryzen 3x00 (Zen 2 cores) + ECC memory (see my signature for more specifics), using the latest stable BIOS. I am using the “overclock the memory until it is barely stable”-method, as described earlier posts.

  1. Memory Injection on Linux, using mce-inject, as described some posts earlier, does not inject memory errors on a platform level, but only on an OS level. So it is not suitable for testing if the IPMI / BMC properly handles memory error detection. We’ve discovered this because the “Platform First Error Handling” toggle in the BIOS, has no effect on this method.
  2. ECC correction works!
  • Already confirmed / proven earlier in this thread.
  • When using default BIOS settings.
  1. (Corrected) single-bit ECC memory error detection by “the OS” works (if correctly implemented)!
  • Already confirmed / proven earlier in this thread.
  • But only when setting “Platform First Error Handling” to disabled in the BIOS.
  • Works on for example
    • Memtest86 v8.4 or higher
    • Linux kernel 5.6 or higher
    • TrueNAS 12.0 beta 1 (not on FreeNAS 11.3)
  1. (Uncorrected) multi-bit ECC memory error detection by “the OS” works (if correctly implemented)!
  • This is a new discovery.
  • But only when setting “Platform First Error Handling” to disabled in the BIOS.
  • Works on for example
    • Memtest86 (unreleased version - fixes will be included in next release)
    • Linux kernel 5.7 (probably also on 5.6, but I didn’t try it)
    • Not sure about TrueNAS 12.0 beta 1. I haven’t been able to trigger or recognize it yet.
  1. IPMI / BMC is unable to detect any kind of memory error
  • Confirmed once more.
  • Even when setting “Platform First Error Handling” to enabled in the BIOS.
  • Asrock Rack is (hopefully) still working on getting this fixed?

(Uncorrected) multi-bit ECC memory error detection by "the OS"

Memtest86

After notifying Passmark that Linux is able to detect (uncorrected) multi-bit ECC memory errors and Memtest86 v8.4 isn’t, they’ve asked me to send the log files. They then provided me a new version (so far still unreleased) which fixes the issue and can properly detect (uncorrected) multi-bit ECC memory errors!

Sorry, forgot to take a screenshot of this one. I do still have the log file. Here is the summary of the report:

Test Start Time 2020-05-25 08:19:11
Elapsed Time 1:47:46
Memory Range Tested 0x0 - 80F380000 (33011MB)
CPU Selection Mode Parallel (All CPUs)
ECC Polling Enabled
# Tests Passed 7/19 (36%)
Lowest Error Address 0x489128C48 (18577MB)
Highest Error Address 0x73D8367A0 (29656MB)
Bits in Error Mask 00000000FDDFFFFF
Bits in Error 30
Max Contiguous Errors 2
ECC Correctable Errors 2689
ECC Uncorrectable Errors 244

Linux

Maybe I have triggered these earlier already, but I didn’t notice them till recently.

[root@localhost ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 3 Corrected Errors
mc0: csrow3: 1 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 3 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors

[root@localhost ~]# ras-mc-ctl --summary

Memory controller events summary:
        Corrected on DIMM Label(s): 'mc#0csrow#2channel#1' location: 0:2:1:-1 errors: 3
        Corrected on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 3
        Fatal on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 1
No PCIe AER errors.
No Extlog errors.
No devlink errors.

Disk errors summary:
        0:0 has 17 errors
        0:2048 has 147 errors
        0:2816 has 4 errors

MCE records summary:
 12 Corrected error, no action required. errors
 1 Deferred error, no action required. errors
  2 Uncorrected, software containable error. errors

[root@localhost ~]#

[root@localhost ~]# cat /var/log/messages
...
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:08:59 localhost kernel: mce_notify_irq: 1 callbacks suppressed

May 20 00:08:59 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:08:59 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:08:59 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:08:59 localhost kernel: [Hardware Error]: Error Addr: 0x00000003080ccb40
May 20 00:08:59 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:08:59 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:08:59 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:08:59 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:08:59 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:08:59 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:08:59 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:08:59 localhost kernel: [Hardware Error]: Error Addr: 0x00000003095cc100
May 20 00:08:59 localhost kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x510600800a800302
May 20 00:08:59 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:08:59 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x80)
May 20 00:08:59 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mce_record:           2020-04-01 19:34:33 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:08:59 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a0f7c01000000, addr= 3080ccb40, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mc_event:             2020-04-01 19:34:33 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mce_record:           2020-04-01 19:34:33 +0200 Unified Memory Controller (bank=18), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:08:59 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=2, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a01d301000000, addr= 3095cc100, synd= 510600800a800302, ipid= 9600150f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]:           <...>-661   [000]     0.000066: mc_event:             2020-04-01 19:34:33 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#2channel#1 (mc: 0 location: 2:1 grain: 6 syndrome: 0x00000080)
May 20 00:08:59 localhost abrt-server[1611]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost abrt-server[1614]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost abrt-server[1618]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost systemd[1]: Started dbus-:[email protected].
May 20 00:08:59 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:09:00 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:09:00 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:09:00 localhost abrt-notification[1657]: System encountered a non-fatal error in ??()
May 20 00:09:01 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:11:12 localhost systemd[1]: dbus-:[email protected]: Succeeded.
May 20 00:11:12 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:15 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:12:15 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:12:15 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:12:15 localhost kernel: [Hardware Error]: Error Addr: 0x0000000301a4ef80
May 20 00:12:15 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:12:15 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:12:15 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:12:15 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:12:15 localhost rasdaemon[995]:           <...>-661   [000]     0.000086: mce_record:           2020-04-01 19:37:49 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:12:15 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 301a4ef80, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:12:15 localhost rasdaemon[995]:           <...>-661   [000]     0.000086: mc_event:             2020-04-01 19:37:49 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:12:15 localhost abrt-server[1674]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:12:15 localhost systemd[1]: Started dbus-:[email protected].
May 20 00:12:15 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@3 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:17 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:12:17 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:12:17 localhost abrt-notification[1710]: System encountered a non-fatal error in ??()
May 20 00:12:18 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:12:59 localhost systemd[1]: Starting Cleanup of Temporary Directories...
May 20 00:12:59 localhost systemd-tmpfiles[1712]: /usr/lib/tmpfiles.d/BackupPC.conf:1: Line references path below legacy directory /var/run/, updating /var/run/BackupPC → /run/BackupPC; please update the tmpfiles.d/ drop-in file accordingly.
May 20 00:12:59 localhost systemd-tmpfiles[1712]: /etc/tmpfiles.d/tpm2-tss-fapi.conf:3: Line references path below legacy directory /var/run/, updating /var/run/tpm2-tss/eventlog → /run/tpm2-tss/eventlog; please update the tmpfiles.d/ drop-in file accordingly.
May 20 00:12:59 localhost systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
May 20 00:12:59 localhost systemd[1]: Finished Cleanup of Temporary Directories.
May 20 00:12:59 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:59 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8

May 20 00:14:26 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:14:26 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:14:26 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:14:26 localhost kernel: [Hardware Error]: Error Addr: 0x0000000395164300
May 20 00:14:26 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:14:26 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:14:26 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:14:26 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:14:26 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:14:26 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:14:26 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:14:26 localhost kernel: [Hardware Error]: Error Addr: 0x000000030088c100
May 20 00:14:26 localhost kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x510600800a800302
May 20 00:14:26 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:14:26 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x80)
May 20 00:14:26 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mce_record:           2020-04-01 19:40:01 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:14:26 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 395164300, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mc_event:             2020-04-01 19:40:01 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mce_record:           2020-04-01 19:40:01 +0200 Unified Memory Controller (bank=18), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:14:26 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=2, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a033c01000000, addr= 30088c100, synd= 510600800a800302, ipid= 9600150f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]:           <...>-661   [000]     0.000099: mc_event:             2020-04-01 19:40:01 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#2channel#1 (mc: 0 location: 2:1 grain: 6 syndrome: 0x00000080)
May 20 00:14:26 localhost abrt-server[1729]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:26 localhost abrt-server[1732]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:26 localhost abrt-server[1735]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:28 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:14:28 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:14:28 localhost abrt-notification[1772]: System encountered a non-fatal error in ??()
May 20 00:14:28 localhost systemd[1]: dbus-:[email protected]: Succeeded.
May 20 00:14:28 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@3 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:14:29 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:17:03 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8

May 20 00:17:03 localhost kernel: mce: Uncorrected hardware memory error in user-access at 621211640
May 20 00:17:03 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:17:03 localhost kernel: [Hardware Error]: Uncorrected, software restartable error.
May 20 00:17:03 localhost kernel: [Hardware Error]: CPU:9 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
May 20 00:17:03 localhost kernel: [Hardware Error]: Error Addr: 0x0000000621211640
May 20 00:17:03 localhost kernel: [Hardware Error]: IPID: 0x000000b000000000
May 20 00:17:03 localhost kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
May 20 00:17:03 localhost kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
May 20 00:17:03 localhost kernel: Memory failure: 0x621211: Sending SIGBUS to memtester:1666 due to hardware memory corruption
May 20 00:17:03 localhost kernel: Memory failure: 0x621211: recovery action for dirty LRU page: Recovered
May 20 00:17:03 localhost audit[1666]: ANOM_ABEND auid=0 uid=0 gid=0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=1666 comm="memtester" exe="/usr/bin/memtester" sig=7 res=1
May 20 00:17:03 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:17:03 localhost rasdaemon[995]:           <...>-213   [009]     0.000114: mce_record:           2020-04-01 19:42:37 +0200 Load Store Unit (bank=0), status= bc002800000c0135, Uncorrected, software containable error., mci=UECC Poison consumed, mca= DC data error type 1 (poison consumption).
May 20 00:17:03 localhost rasdaemon[995]: Memory Error 'mem-tx: data read, tx: data, level: L1', cpu_type= AMD Family 17h Zen1, cpu= 9, socketid= 0, ip= 401e81, cs= 33, misc= d01a000000000000, addr= 621211640, ipid= b000000000, mcgstatus=7 RIPV EIPV MCIP, mcgcap= 11c, apicid= 9
May 20 00:17:03 localhost audit: BPF prog-id=44 op=LOAD
May 20 00:17:03 localhost audit: BPF prog-id=45 op=LOAD
May 20 00:17:03 localhost audit: BPF prog-id=46 op=LOAD
May 20 00:17:03 localhost systemd[1]: Started Process Core Dump (PID 1790/UID 0).
May 20 00:17:03 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@1-1790-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:03 localhost systemd[1]: Started dbus-:[email protected].
May 20 00:17:03 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@4 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:04 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:17:04 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:17:05 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:17:06 localhost abrt-notification[1833]: System encountered a non-fatal error in ??()
May 20 00:17:07 localhost systemd-coredump[1792]: Core file was truncated to 2147483648 bytes.
May 20 00:17:08 localhost abrt-dump-journal-core[1035]: Failed to obtain all required information from journald
May 20 00:17:12 localhost systemd-coredump[1792]: Process 1666 (memtester) of user 0 dumped core.#012#012Stack trace of thread 1666:#012#0  0x0000000000401e81 compare_regions (/usr/bin/memtester + 0x1e81)
May 20 00:17:12 localhost systemd[1]: [email protected]: Succeeded.
May 20 00:17:12 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@1-1790-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:12 localhost systemd[1]: [email protected]: Consumed 1.976s CPU time.
May 20 00:17:12 localhost audit: BPF prog-id=46 op=UNLOAD
May 20 00:17:12 localhost audit: BPF prog-id=45 op=UNLOAD
May 20 00:17:12 localhost audit: BPF prog-id=44 op=UNLOAD
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:17:04-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:14:28-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:12:17-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:09:00-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:03:33-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:03:31-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:17:03-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:12:15-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:14:26-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:08:59-995'
May 20 00:17:17 localhost abrt-server[1844]: Error: No segments found in coredump './coredump'
May 20 00:17:17 localhost abrt-server[1844]: Can't open file 'core_backtrace' for reading: No such file or directory
May 20 00:17:17 localhost abrt-notification[1889]: Process 1666 (memtester) crashed in ??()
2 Likes

FreeNAS / TrueNAS testing

I’ve also done some brief testing in FreeNAS / TrueNAS. For this I’ve created a Fedora 32 virtual machine inside FreeNAS / TrueNAS, allocating 20GB of the 32GB of RAM to the VM and then ran “memtester 18gb” in the Fedora VM to stress the memory. Below are the results:

  • FreeNAS 11.3 U3.2 (and probably earlier as well) does not detect anything at all. It just crashes after awhile (probably when an uncorrected error occurs). I couldn’t find anything in the logs.
  • TrueNAS 12.0 beta 1 properly detects the corrected errors and shows the following on the console and in /var/log/messages
Jul  7 13:08:50 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:08:50 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:50 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:50 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:08:50 data MCA: Address 0x400000326059a00
Jul  7 13:08:50 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:08:50 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:08:50 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:50 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:50 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:08:50 data MCA: Address 0x40000031dc09ae0
Jul  7 13:08:50 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:08:54 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:08:54 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:54 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:54 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:08:54 data MCA: Address 0x40000032772a880
Jul  7 13:08:54 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:08:54 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:08:54 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:08:54 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:08:54 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:08:54 data MCA: Address 0x400000323044240
Jul  7 13:08:54 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:08:56 data kernel: ix1: link state changed to UP
Jul  7 13:09:51 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:09:51 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:09:51 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:09:51 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:09:51 data MCA: Address 0x4000003254bf4c0
Jul  7 13:09:51 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:09:51 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:09:51 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:09:51 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:09:51 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:09:51 data MCA: Address 0x4000003254b8240
Jul  7 13:09:51 data MCA: Misc 0xd01a0ffd01000000
Jul  7 13:12:52 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:12:52 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:12:52 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:12:52 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:12:52 data MCA: Address 0x4000003242494c0
Jul  7 13:12:52 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:12:52 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:12:52 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:12:52 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:12:52 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:12:52 data MCA: Address 0x40000031dc09ac0
Jul  7 13:12:52 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:13:03 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:13:03 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:13:03 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:13:03 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:13:03 data MCA: Address 0x400000275f39e00
Jul  7 13:13:03 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:13:20 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:13:20 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:13:20 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:13:20 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:13:20 data MCA: Address 0x40000026edd1e00
Jul  7 13:13:20 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:13:20 data MCA: Bank 18, Status 0x9c2040000000011b
Jul  7 13:13:20 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:13:20 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:13:20 data MCA: CPU 0 COR GCACHE LG RD error
Jul  7 13:13:20 data MCA: Address 0x4000002c7adc4c0
Jul  7 13:13:20 data MCA: Misc 0xd01a0ffc01000000
Jul  7 13:14:17 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:14:17 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:17 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:17 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:17 data MCA: Address 0x4000002b303e9c0
Jul  7 13:14:17 data MCA: Misc 0xd01a0fac01000000
Jul  7 13:14:17 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:14:17 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:17 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:17 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:17 data MCA: Address 0x4000002c2c024c0
Jul  7 13:14:17 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:14:44 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:14:44 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:44 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:44 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:44 data MCA: Address 0x400000293281500
Jul  7 13:14:44 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:14:44 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:14:44 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:44 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:44 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:44 data MCA: Address 0x400000327290240
Jul  7 13:14:44 data MCA: Misc 0xd01a0ffb01000000
Jul  7 13:14:56 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:14:56 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:56 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:56 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:56 data MCA: Address 0x40000023a430300
Jul  7 13:14:56 data MCA: Misc 0xd01a0f3a01000000
Jul  7 13:14:56 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:14:56 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:14:56 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:14:56 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:14:56 data MCA: Address 0x4000002ab5afb00
Jul  7 13:14:56 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:15:08 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:15:08 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:08 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:08 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:08 data MCA: Address 0x400000217793440
Jul  7 13:15:08 data MCA: Misc 0xd01a0f9301000000
Jul  7 13:15:08 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:15:08 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:08 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:08 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:08 data MCA: Address 0x40000029ef7f880
Jul  7 13:15:08 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:15:23 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:15:23 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:23 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:23 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:23 data MCA: Address 0x4000002186c7440
Jul  7 13:15:23 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:15:23 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:15:23 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:23 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:23 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:23 data MCA: Address 0x400000327290240
Jul  7 13:15:23 data MCA: Misc 0xd01a0ff701000000
Jul  7 13:15:38 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:15:38 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:38 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:38 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:38 data MCA: Address 0x400000264398f40
Jul  7 13:15:38 data MCA: Misc 0xd01a0e2301000000
Jul  7 13:15:38 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:15:38 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:15:38 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:15:38 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:15:38 data MCA: Address 0x40000029aba7880
Jul  7 13:15:38 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:16:05 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:16:05 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:16:05 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:16:05 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:16:05 data MCA: Address 0x400000218d41440
Jul  7 13:16:05 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:16:05 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:16:05 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:16:05 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:16:05 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:16:05 data MCA: Address 0x4000002c11e64c0
Jul  7 13:16:05 data MCA: Misc 0xd01a0fed01000000
Jul  7 13:24:40 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:24:40 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:24:40 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:24:40 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:24:40 data MCA: Address 0x4000002beace500
Jul  7 13:24:40 data MCA: Misc 0xd01a085001000000
Jul  7 13:24:40 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:24:40 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:24:40 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:24:40 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:24:40 data MCA: Address 0x4000002c43a04c0
Jul  7 13:24:40 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:36:00 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:36:00 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:36:00 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:36:00 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:36:00 data MCA: Address 0x4000003242494e0
Jul  7 13:36:00 data MCA: Misc 0xd01a08ab01000000
Jul  7 13:36:00 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:36:00 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:36:00 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:36:00 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:36:00 data MCA: Address 0x4000002cd0a6200
Jul  7 13:36:00 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:38:35 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:38:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:38:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:38:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:38:35 data MCA: Address 0x4000002b50ea9c0
Jul  7 13:38:35 data MCA: Misc 0xd01a0e2301000000
Jul  7 13:38:35 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:38:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:38:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:38:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:38:35 data MCA: Address 0x40000032517d6c0
Jul  7 13:38:35 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:39:18 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:39:18 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:39:18 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:39:18 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:39:18 data MCA: Address 0x4000002c4bfd800
Jul  7 13:39:18 data MCA: Misc 0xd01a0c5201000000
Jul  7 13:39:18 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:39:18 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:39:18 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:39:18 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:39:18 data MCA: Address 0x4000002c16784c0
Jul  7 13:39:18 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:40:35 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:40:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:40:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:40:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:40:35 data MCA: Address 0x400000292f99500
Jul  7 13:40:35 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:40:35 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:40:35 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:40:35 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:40:35 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:40:35 data MCA: Address 0x4000002c6f2c4c0
Jul  7 13:40:35 data MCA: Misc 0xd01a0fe401000000
Jul  7 13:41:45 data MCA: Bank 17, Status 0xdc2040000000011b
Jul  7 13:41:45 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:41:45 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:41:45 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:41:45 data MCA: Address 0x40000027d3c2cc0
Jul  7 13:41:45 data MCA: Misc 0xd01b0fff01000000
Jul  7 13:41:45 data MCA: Bank 18, Status 0xdc2040000000011b
Jul  7 13:41:45 data MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Jul  7 13:41:45 data MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Jul  7 13:41:45 data MCA: CPU 0 COR OVER GCACHE LG RD error
Jul  7 13:41:45 data MCA: Address 0x4000002c535a4c0
Jul  7 13:41:45 data MCA: Misc 0xd01a0fd801000000

I tried to decode them using mcelog, but it seems the CPU currently isn’t supported. I am in contact with maintainer of the mcelog code to get this fixed…

> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 0
> CPU 0 BANK 17 TSC 5be1c86174 
> MISC d01b0fff01000000 ADDR 4000003085c6b40 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS 9c2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 1
> CPU 0 BANK 17 TSC 5bf3eabd20 
> MISC d01b0fff01000000 ADDR 40000031f3bab00 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 2
> CPU 0 BANK 17 TSC 5c1e4b68ec 
> MISC d01b0fff01000000 ADDR 400000305340b40 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 3
> CPU 0 BANK 17 TSC 5d1e06bb8c 
> MISC d01a0ffd01000000 ADDR 40000032212fe80 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 4
> CPU 0 BANK 18 TSC 5d1e06fb40 
> MISC d01b0fff01000000 ADDR 400000321c6e8c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS 9c2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 5
> CPU 0 BANK 17 TSC 5db84458c8 
> MISC d01a0ffa01000000 ADDR 400000321cb62c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 6
> CPU 0 BANK 18 TSC 5db8448fc4 
> MISC d01b0fff01000000 ADDR 400000326096240 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 7
> CPU 0 BANK 17 TSC 6525f2fee0 
> MISC d01b0fff01000000 ADDR 400000326059a00 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 8
> CPU 0 BANK 18 TSC 6525f33840 
> MISC d01a0ffc01000000 ADDR 40000031dc09ae0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS 9c2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 9
> CPU 0 BANK 17 TSC 68c46a10b8 
> MISC d01b0fff01000000 ADDR 40000032772a880 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 10
> CPU 0 BANK 18 TSC 68c46a4670 
> MISC d01a0ffc01000000 ADDR 400000323044240 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS 9c2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 11
> CPU 0 BANK 17 TSC 97f381ab20 
> MISC d01b0fff01000000 ADDR 4000003254bf4c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 12
> CPU 0 BANK 18 TSC 97f381e0fc 
> MISC d01a0ffd01000000 ADDR 4000003254b8240 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 13
> CPU 0 BANK 17 TSC 12f2bbd5708 
> MISC d01b0fff01000000 ADDR 4000003242494c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 14
> CPU 0 BANK 18 TSC 12f2bbd8dbc 
> MISC d01a0ffc01000000 ADDR 40000031dc09ac0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS 9c2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 15
> CPU 0 BANK 17 TSC 138ab97bb58 
> MISC d01b0fff01000000 ADDR 400000275f39e00 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 16
> CPU 0 BANK 17 TSC 1476695ee60 
> MISC d01b0fff01000000 ADDR 40000026edd1e00 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 17
> CPU 0 BANK 18 TSC 147669621d8 
> MISC d01a0ffc01000000 ADDR 4000002c7adc4c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS 9c2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 18
> CPU 0 BANK 17 TSC 17684f22960 
> MISC d01a0fac01000000 ADDR 4000002b303e9c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 19
> CPU 0 BANK 18 TSC 17684f25bdc 
> MISC d01b0fff01000000 ADDR 4000002c2c024c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 20
> CPU 0 BANK 17 TSC 18d341a1658 
> MISC d01b0fff01000000 ADDR 400000293281500 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 21
> CPU 0 BANK 18 TSC 18d341a4fdc 
> MISC d01a0ffb01000000 ADDR 400000327290240 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 22
> CPU 0 BANK 17 TSC 196d9d34680 
> MISC d01a0f3a01000000 ADDR 40000023a430300 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 23
> CPU 0 BANK 18 TSC 196d9d389b8 
> MISC d01b0fff01000000 ADDR 4000002ab5afb00 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 24
> CPU 0 BANK 17 TSC 1a11a4a49c4 
> MISC d01a0f9301000000 ADDR 400000217793440 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 25
> CPU 0 BANK 18 TSC 1a11a4a8348 
> MISC d01b0fff01000000 ADDR 40000029ef7f880 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 26
> CPU 0 BANK 17 TSC 1ae230113e8 
> MISC d01b0fff01000000 ADDR 4000002186c7440 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 27
> CPU 0 BANK 18 TSC 1ae23014dd8 
> MISC d01a0ff701000000 ADDR 400000327290240 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 28
> CPU 0 BANK 17 TSC 1b9faf273e0 
> MISC d01a0e2301000000 ADDR 400000264398f40 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 29
> CPU 0 BANK 18 TSC 1b9faf2b034 
> MISC d01b0fff01000000 ADDR 40000029aba7880 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 30
> CPU 0 BANK 17 TSC 1d0fa35ac30 
> MISC d01b0fff01000000 ADDR 400000218d41440 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 31
> CPU 0 BANK 18 TSC 1d0fa35e83c 
> MISC d01a0fed01000000 ADDR 4000002c11e64c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 32
> CPU 0 BANK 17 TSC 3802329c0a8 
> MISC d01a085001000000 ADDR 4000002beace500 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
> mcelog: Unknown CPU type vendor 2 family 23 model 1
> Hardware event. This is not a software error.
> MCE 33
> CPU 0 BANK 18 TSC 3802329f588 
> MISC d01b0fff01000000 ADDR 4000002c43a04c0 
> TIME 1594121663 Tue Jul  7 13:34:23 2020
> STATUS dc2040000000011b MCGSTATUS 0
> MCGCAP 11c APICID 0 SOCKETID 0 
> CPUID Vendor AMD Family 23 Model 1 Step 0
  • I haven’t been able to trigger uncorrected errors yet on TrueNAS 12.0 beta 1, I think. The corrected errors have the same status code on FreeBSD, as on Linux. And I couldn’t find any error in FreeBSD with the status code of the uncorrected errors in Linux. I suspect that uncorrected errors occur when the system crashes after awhile. But I couldn’t find anything about this in the log files. Not sure if this is a bug or not. Please let me know if I need to submit this…
  • TrueNAS 12.0 beta 1 doesn’t seem to send any email when MCA errors occur. If I remember well, this is a “known bug” (please correct me if I’m wrong). This does seem very troublesome and I hope this gets fixed soon!
2 Likes

going to build a NAS finally got AsRock Rack X470D4U with ECC memory 32gig sticks. Should I go Zen 2700 or 3600?

going freeNas

will be using 10tb Ironwolf Pro HDDs. Also what SAS breakout card would yall recomend. I forget the term. Not raid cards but the cards recommended for ZFS.

First time building a NAS and using Linux, but tech savy. Finally building a NAS after my computer’s OS drive is currently failing. Don’t worry already backed up what is on that drive

Probably would go 3600 due to better power / perf and even that is prob overkill

what case is is going in / how many drives do you need?

freenas was/still is BSD (they are moving to linux at some point if I am not mistaken) (just an FYI, you can run zfs on linux builds tho if you want to stay that route)

Thanks for a good post again - it will be a valuable resource to point to when it comes to ECC handling on this board.

Multi-bit error reporting news are very welcome.
While reading your logs I see that they do not cause a reboot/panic.
Is this the default behavior or something you changed/configured?

If you read /var/log/messages in more detail, you can see that Linux is able to differentiate between different kinds of “Uncorrectable errors”:

e.g.:
“Uncorrected hardware memory error in user-access at 621211640”

Memory errors in “user” parts of the memory don’t cause kernel panic.
Memory errors in “kernel” parts of the memory will cause a kernel panic.

On first sight FreeBSD does not make this kind of differentiation (or is unable to do so with my AMD CPU) and will panic if it occurs.

I found a bit more detail on how FreeBSD handles it here:
http://freebsd.1045724.x6.nabble.com/Interpreting-MCA-error-output-td4859347.html

I also just received a response from the person maintaining the mcelog code in FreeBSD:

Hello Mastakilla,

After some debugging, my initial analysis was incorrect. The k8 CPU’s line was the last supported AMD CPU supported by mcelog. AMD needs to supply patches to mcelog in order to gain this support in mcelog.

Mcelog is an intel-driven project, the devs that contribute are intel employees so it is possible amd doesn’t want to be associated with the project but I’m not really positive. There is an upstream bug[1] report that may make things a bit more clear.

[1] https://github.com/andikleen/mcelog/issues/62

Hope this helps,

Richard Gallamore

i am more considered about the cpu working with ECC memory with as much trouble. From what I am reading in the channel zen 2 doesnt like ECC as much as 1.5 gen.
Am I right?

(except for the IPMI not reporting ECC memory errors) ECC seems to fully work on my Ryzen 3600 CPU with this motherboard. So I’m not sure what you mean with “Zen 2 not liking ECC”?

Hi all! First off, I’ve found this thread very useful, so thank you!
I’ve been using the X470D4U for a couple of months now for an Unraid setup and it’s been working fine.
A few days ago I’ve added an Intel I350-T4 NIC and setup up a pfSense VM that is also working just fine.
My problem is that since I added the NIC, the IPMI doesn’t appear on my network at all. The lights on the switch port light up as if it is working, but it doesn’t appear anywhere on my network. The IPMI plugin in Unraid seems to be able to communicate with it just fine.
Any ideas what is wrong and how I can fix it?
Thanks!

No idea why would that happen but I guess the bonding is misbehaving again. One of the IPMI versions tried to fix it already (check release notes for IPMI/BMC version 01.60.00: https://www.asrockrack.com/general/productdetail.asp?Model=X470D4U#Download)

First I would check in the BIOS>Server Mgmt>‘network (or something like that)’ and verify that the IP is being set properly, if you are using static IP I would check if it’s not duplicated and if the sub-net is set correctly.

Other then that check if you are able to access IPMI through any of the 3 on board NICs. (I assume you are currently using the dedicated port above the USB ports)

What BIOS and IPMI versions are you running?

I’m on 1.90 for the BMC.
Getting to the bios without the IPMI is going to be tricky, but I’ll try it if I manage to get GPU from another machine.
How do I access the IPMI from the other onboard NICs? I usually just go to the IP address that is assigned to the dedicated NIC.

That’s exactly what I meant, just connect the cable to the other ports and check if you can access IPMI normally through the IP you set previously. Maybe the it is now visible on non-dedicated port. (You can try disconnecting the dedicated port at the same time)

If getting to the bios is difficult then try to use ipmitool from the host OS running on the board.
For example:
ipmitool -U admin -P admin lan print
it should have access through /dev/ipmi0

Edit:
Advice for the future: make yourself a favor and setup a console redirection in the bios settings. You will be able to access the bios without a display/or a web kvm.

It will use serial-over-lan (accessible through ssh or ipmitool).
You can see how it will look like in one of my previous posts:

Thanks a lot for helping!

I’m trying this, but I’m getting the following:
Cipher Suite Priv Max : Not Available
Bad Password Threshold : Not Available

I assume -U is username and -P is password. I tried both the values I’ve set and admin/admin, but I get the same with both. I haven’t used ipmitool before and I’m a bit lost I’m afraid

Edit:
I managed to get the IPMI to show up again on my network by turning it off (turning off the PSU), removing the intel NIC, turning on the PSU and then plugging the NIC in again after the IPMI had booted up. I bet if the board powers off and the IPMI shuts down again it will be lost, but I can’t be arsed to try it now.
Annoyingly I can’t get into the bios via the IPMI. For some reason when my pfSense VM is down, I can’t reach the IPMI, even though it’s on the same switch and subnet as the computer I’m trying to access it from.
I’ll need to find a better way to manage this or move pfSense to a physical box.
Thanks again for the help!

There was an issue with IPMI failing to respond until power cycling:

AFAIK it was fixed in one of the BMC versions and since you are running the latest 1.90 I guess this should no longer happen.

@nx2l @Waishon
Can you confirm if this was fixed?



That is correct. -U is username and -P is password - using credentials you have set should work. You can also do not provide -P - it should prompt you for password separately

Also:

Those are a part of lan print log. So i guess it worked, it just seems like it’s the lan could be disabled?

For me the lan print shows something like this right now:

Set in Progress : Set Complete
Auth Type Support : MD5
Auth Type Enable : Callback : MD5
: User : MD5
: Operator : MD5
: Admin : MD5
: OEM : MD5
IP Address Source : DHCP Address
IP Address : 192.168.1.168
Subnet Mask : 255.255.255.0
MAC Address : d0:50:99:e3:44:d9
SNMP Community String : AMI
IP Header : TTL=0x40 Flags=0x40 Precedence=0x00 TOS=0x10
BMC ARP Control : ARP Responses Enabled, Gratuitous ARP Disabled
Gratituous ARP Intrvl : 0.0 seconds
Default Gateway IP : 192.168.1.1
Default Gateway MAC : 00:26:55:e3:05:c1
Backup Gateway IP : 0.0.0.0
Backup Gateway MAC : 00:00:00:00:00:00
802.1q VLAN ID : Disabled
802.1q VLAN Priority : 0
RMCP+ Cipher Suites : 0,1,2,3,6,7,8,11,12,15,16,17
Cipher Suite Priv Max : caaaaaaaaaaaXXX
: X=Cipher Suite Unused
: c=CALLBACK
: u=USER
: o=OPERATOR
: a=ADMIN
: O=OEM
Bad Password Threshold : 0
Invalid password disable: no
Attempt Count Reset Int.: 0
User Lockout Interval : 0

You could try checking if you can at least list available commands.
If the login worked then this should do it:
ipmitool -U admin -P admin help

Not sure if this is the same as what I’m getting. Power cycling alone doesn’t fixing, I need to remove the NIC and boot the BMC without it.

Seems so, it would explain why it wasn’t showing up in my network. Pretty wierd that adding a pcie card disables the BMC port.