Hi everyone,
It’s been awhile ago, but I have another update…
I’ll start with a summary and then post all “evidence”. This is all regarding the platform ASRock Rack X470D4U2-2T + AMD Ryzen 3x00 (Zen 2 cores) + ECC memory (see my signature for more specifics), using the latest stable BIOS. I am using the “overclock the memory until it is barely stable”-method, as described earlier posts.
- Memory Injection on Linux, using mce-inject, as described some posts earlier, does not inject memory errors on a platform level, but only on an OS level. So it is not suitable for testing if the IPMI / BMC properly handles memory error detection. We’ve discovered this because the “Platform First Error Handling” toggle in the BIOS, has no effect on this method.
- ECC correction works!
- Already confirmed / proven earlier in this thread.
- When using default BIOS settings.
- (Corrected) single-bit ECC memory error detection by “the OS” works (if correctly implemented)!
- Already confirmed / proven earlier in this thread.
- But only when setting “Platform First Error Handling” to disabled in the BIOS.
- Works on for example
- Memtest86 v8.4 or higher
- Linux kernel 5.6 or higher
- TrueNAS 12.0 beta 1 (not on FreeNAS 11.3)
- (Uncorrected) multi-bit ECC memory error detection by “the OS” works (if correctly implemented)!
- This is a new discovery.
- But only when setting “Platform First Error Handling” to disabled in the BIOS.
- Works on for example
- Memtest86 (unreleased version - fixes will be included in next release)
- Linux kernel 5.7 (probably also on 5.6, but I didn’t try it)
- Not sure about TrueNAS 12.0 beta 1. I haven’t been able to trigger or recognize it yet.
- IPMI / BMC is unable to detect any kind of memory error
- Confirmed once more.
- Even when setting “Platform First Error Handling” to enabled in the BIOS.
- Asrock Rack is (hopefully) still working on getting this fixed?
(Uncorrected) multi-bit ECC memory error detection by "the OS"
Memtest86
After notifying Passmark that Linux is able to detect (uncorrected) multi-bit ECC memory errors and Memtest86 v8.4 isn’t, they’ve asked me to send the log files. They then provided me a new version (so far still unreleased) which fixes the issue and can properly detect (uncorrected) multi-bit ECC memory errors!
Sorry, forgot to take a screenshot of this one. I do still have the log file. Here is the summary of the report:
Test Start Time |
2020-05-25 08:19:11 |
Elapsed Time |
1:47:46 |
Memory Range Tested |
0x0 - 80F380000 (33011MB) |
CPU Selection Mode |
Parallel (All CPUs) |
ECC Polling |
Enabled |
# Tests Passed |
7/19 (36%) |
Lowest Error Address |
0x489128C48 (18577MB) |
Highest Error Address |
0x73D8367A0 (29656MB) |
Bits in Error Mask |
00000000FDDFFFFF |
Bits in Error |
30 |
Max Contiguous Errors |
2 |
ECC Correctable Errors |
2689 |
ECC Uncorrectable Errors |
244 |
Linux
Maybe I have triggered these earlier already, but I didn’t notice them till recently.
[root@localhost ~]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 3 Corrected Errors
mc0: csrow3: 1 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 3 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
[root@localhost ~]# ras-mc-ctl --summary
Memory controller events summary:
Corrected on DIMM Label(s): 'mc#0csrow#2channel#1' location: 0:2:1:-1 errors: 3
Corrected on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 3
Fatal on DIMM Label(s): 'mc#0csrow#3channel#0' location: 0:3:0:-1 errors: 1
No PCIe AER errors.
No Extlog errors.
No devlink errors.
Disk errors summary:
0:0 has 17 errors
0:2048 has 147 errors
0:2816 has 4 errors
MCE records summary:
12 Corrected error, no action required. errors
1 Deferred error, no action required. errors
2 Uncorrected, software containable error. errors
[root@localhost ~]#
[root@localhost ~]# cat /var/log/messages
...
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:08:59 localhost kernel: mce_notify_irq: 1 callbacks suppressed
May 20 00:08:59 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:08:59 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:08:59 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:08:59 localhost kernel: [Hardware Error]: Error Addr: 0x00000003080ccb40
May 20 00:08:59 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:08:59 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:08:59 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:08:59 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 20 00:08:59 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:08:59 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:08:59 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:08:59 localhost kernel: [Hardware Error]: Error Addr: 0x00000003095cc100
May 20 00:08:59 localhost kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x510600800a800302
May 20 00:08:59 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:08:59 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x80)
May 20 00:08:59 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]: <...>-661 [000] 0.000066: mce_record: 2020-04-01 19:34:33 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:08:59 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a0f7c01000000, addr= 3080ccb40, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]: <...>-661 [000] 0.000066: mc_event: 2020-04-01 19:34:33 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]: <...>-661 [000] 0.000066: mce_record: 2020-04-01 19:34:33 +0200 Unified Memory Controller (bank=18), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:08:59 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=2, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a01d301000000, addr= 3095cc100, synd= 510600800a800302, ipid= 9600150f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:08:59 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:08:59 localhost rasdaemon[995]: <...>-661 [000] 0.000066: mc_event: 2020-04-01 19:34:33 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#2channel#1 (mc: 0 location: 2:1 grain: 6 syndrome: 0x00000080)
May 20 00:08:59 localhost abrt-server[1611]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost abrt-server[1614]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost abrt-server[1618]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:08:59 localhost systemd[1]: Started dbus-:[email protected].
May 20 00:08:59 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:09:00 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:09:00 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:09:00 localhost abrt-notification[1657]: System encountered a non-fatal error in ??()
May 20 00:09:01 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:11:12 localhost systemd[1]: dbus-:[email protected]: Succeeded.
May 20 00:11:12 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@2 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:15 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:12:15 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:12:15 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:12:15 localhost kernel: [Hardware Error]: Error Addr: 0x0000000301a4ef80
May 20 00:12:15 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:12:15 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:12:15 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:12:15 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:12:15 localhost rasdaemon[995]: <...>-661 [000] 0.000086: mce_record: 2020-04-01 19:37:49 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:12:15 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 301a4ef80, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:12:15 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:12:15 localhost rasdaemon[995]: <...>-661 [000] 0.000086: mc_event: 2020-04-01 19:37:49 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:12:15 localhost abrt-server[1674]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:12:15 localhost systemd[1]: Started dbus-:[email protected].
May 20 00:12:15 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@3 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:17 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:12:17 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:12:17 localhost abrt-notification[1710]: System encountered a non-fatal error in ??()
May 20 00:12:18 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:12:59 localhost systemd[1]: Starting Cleanup of Temporary Directories...
May 20 00:12:59 localhost systemd-tmpfiles[1712]: /usr/lib/tmpfiles.d/BackupPC.conf:1: Line references path below legacy directory /var/run/, updating /var/run/BackupPC → /run/BackupPC; please update the tmpfiles.d/ drop-in file accordingly.
May 20 00:12:59 localhost systemd-tmpfiles[1712]: /etc/tmpfiles.d/tpm2-tss-fapi.conf:3: Line references path below legacy directory /var/run/, updating /var/run/tpm2-tss/eventlog → /run/tpm2-tss/eventlog; please update the tmpfiles.d/ drop-in file accordingly.
May 20 00:12:59 localhost systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
May 20 00:12:59 localhost systemd[1]: Finished Cleanup of Temporary Directories.
May 20 00:12:59 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:12:59 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-tmpfiles-clean comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:14:26 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:14:26 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:14:26 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:14:26 localhost kernel: [Hardware Error]: Error Addr: 0x0000000395164300
May 20 00:14:26 localhost kernel: [Hardware Error]: IPID: 0x0000009600050f00, Syndrome: 0xf79c00000b800003
May 20 00:14:26 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:14:26 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#3channel#0 (csrow:3 channel:0 page:0x0 offset:0x0 grain:64 syndrome:0x0)
May 20 00:14:26 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 20 00:14:26 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:14:26 localhost kernel: [Hardware Error]: Corrected error, no action required.
May 20 00:14:26 localhost kernel: [Hardware Error]: CPU:0 (17:71:0) MC18_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
May 20 00:14:26 localhost kernel: [Hardware Error]: Error Addr: 0x000000030088c100
May 20 00:14:26 localhost kernel: [Hardware Error]: IPID: 0x0000009600150f00, Syndrome: 0x510600800a800302
May 20 00:14:26 localhost kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
May 20 00:14:26 localhost kernel: EDAC MC0: 1 CE Cannot decode normalized address on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x0 offset:0x0 grain:64 syndrome:0x80)
May 20 00:14:26 localhost kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]: <...>-661 [000] 0.000099: mce_record: 2020-04-01 19:40:01 +0200 Unified Memory Controller (bank=17), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:14:26 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=3, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 395164300, synd= f79c00000b800003, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]: <...>-661 [000] 0.000099: mc_event: 2020-04-01 19:40:01 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#3channel#0 (mc: 0 location: 3:0 grain: 6)
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]: <...>-661 [000] 0.000099: mce_record: 2020-04-01 19:40:01 +0200 Unified Memory Controller (bank=18), status= dc2040000000011b, Corrected error, no action required., mci=Error_overflow CECC, mca= DRAM ECC error.
May 20 00:14:26 localhost rasdaemon[995]: Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=1,csrow=2, cpu_type= AMD Family 17h Zen1, cpu= 0, socketid= 0, misc= d01a033c01000000, addr= 30088c100, synd= 510600800a800302, ipid= 9600150f00, mcgstatus=0, mcgcap= 11c, apicid= 0
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: mc_event store: 0x55aaea8a4418
May 20 00:14:26 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:14:26 localhost rasdaemon[995]: <...>-661 [000] 0.000099: mc_event: 2020-04-01 19:40:01 +0200 1 Corrected error: Cannot decode normalized address on mc#0csrow#2channel#1 (mc: 0 location: 2:1 grain: 6 syndrome: 0x00000080)
May 20 00:14:26 localhost abrt-server[1729]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:26 localhost abrt-server[1732]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:26 localhost abrt-server[1735]: Not saving repeating crash in '/boot/vmlinuz-5.6.8-300.fc32.x86_64'
May 20 00:14:28 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:14:28 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:14:28 localhost abrt-notification[1772]: System encountered a non-fatal error in ??()
May 20 00:14:28 localhost systemd[1]: dbus-:[email protected]: Succeeded.
May 20 00:14:28 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@3 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:14:29 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:17:03 localhost rasdaemon[995]: rasdaemon: mce_record store: 0x55aaea8a19e8
May 20 00:17:03 localhost kernel: mce: Uncorrected hardware memory error in user-access at 621211640
May 20 00:17:03 localhost kernel: mce: [Hardware Error]: Machine check events logged
May 20 00:17:03 localhost kernel: [Hardware Error]: Uncorrected, software restartable error.
May 20 00:17:03 localhost kernel: [Hardware Error]: CPU:9 (17:71:0) MC0_STATUS[-|UE|MiscV|AddrV|-|-|-|UECC|-|Poison|-]: 0xbc002800000c0135
May 20 00:17:03 localhost kernel: [Hardware Error]: Error Addr: 0x0000000621211640
May 20 00:17:03 localhost kernel: [Hardware Error]: IPID: 0x000000b000000000
May 20 00:17:03 localhost kernel: [Hardware Error]: Load Store Unit Ext. Error Code: 12, DC Data error type 1 and poison consumption.
May 20 00:17:03 localhost kernel: [Hardware Error]: cache level: L1, tx: DATA, mem-tx: DRD
May 20 00:17:03 localhost kernel: Memory failure: 0x621211: Sending SIGBUS to memtester:1666 due to hardware memory corruption
May 20 00:17:03 localhost kernel: Memory failure: 0x621211: recovery action for dirty LRU page: Recovered
May 20 00:17:03 localhost audit[1666]: ANOM_ABEND auid=0 uid=0 gid=0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 pid=1666 comm="memtester" exe="/usr/bin/memtester" sig=7 res=1
May 20 00:17:03 localhost rasdaemon[995]: rasdaemon: register inserted at db
May 20 00:17:03 localhost rasdaemon[995]: <...>-213 [009] 0.000114: mce_record: 2020-04-01 19:42:37 +0200 Load Store Unit (bank=0), status= bc002800000c0135, Uncorrected, software containable error., mci=UECC Poison consumed, mca= DC data error type 1 (poison consumption).
May 20 00:17:03 localhost rasdaemon[995]: Memory Error 'mem-tx: data read, tx: data, level: L1', cpu_type= AMD Family 17h Zen1, cpu= 9, socketid= 0, ip= 401e81, cs= 33, misc= d01a000000000000, addr= 621211640, ipid= b000000000, mcgstatus=7 RIPV EIPV MCIP, mcgcap= 11c, apicid= 9
May 20 00:17:03 localhost audit: BPF prog-id=44 op=LOAD
May 20 00:17:03 localhost audit: BPF prog-id=45 op=LOAD
May 20 00:17:03 localhost audit: BPF prog-id=46 op=LOAD
May 20 00:17:03 localhost systemd[1]: Started Process Core Dump (PID 1790/UID 0).
May 20 00:17:03 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@1-1790-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:03 localhost systemd[1]: Started dbus-:[email protected].
May 20 00:17:03 localhost audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=dbus-:1.3-org.freedesktop.problems@4 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:04 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Found oopses: 1
May 20 00:17:04 localhost abrt-dump-journal-oops[1036]: abrt-dump-journal-oops: Creating problem directories
May 20 00:17:05 localhost abrt-dump-journal-oops[1036]: Reported 1 kernel oopses to Abrt
May 20 00:17:06 localhost abrt-notification[1833]: System encountered a non-fatal error in ??()
May 20 00:17:07 localhost systemd-coredump[1792]: Core file was truncated to 2147483648 bytes.
May 20 00:17:08 localhost abrt-dump-journal-core[1035]: Failed to obtain all required information from journald
May 20 00:17:12 localhost systemd-coredump[1792]: Process 1666 (memtester) of user 0 dumped core.#012#012Stack trace of thread 1666:#012#0 0x0000000000401e81 compare_regions (/usr/bin/memtester + 0x1e81)
May 20 00:17:12 localhost systemd[1]: [email protected]: Succeeded.
May 20 00:17:12 localhost audit[1]: SERVICE_STOP pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-coredump@1-1790-0 comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 20 00:17:12 localhost systemd[1]: [email protected]: Consumed 1.976s CPU time.
May 20 00:17:12 localhost audit: BPF prog-id=46 op=UNLOAD
May 20 00:17:12 localhost audit: BPF prog-id=45 op=UNLOAD
May 20 00:17:12 localhost audit: BPF prog-id=44 op=UNLOAD
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:17:04-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:14:28-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:12:17-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:09:00-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'oops-2020-05-20-00:03:33-1036-0'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:03:31-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:17:03-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:12:15-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:14:26-995'
May 20 00:17:17 localhost abrtd[1003]: Size of '/var/spool/abrt' >= 5000 MB (MaxCrashReportsSize), deleting old directory 'ras-2020-05-20-00:08:59-995'
May 20 00:17:17 localhost abrt-server[1844]: Error: No segments found in coredump './coredump'
May 20 00:17:17 localhost abrt-server[1844]: Can't open file 'core_backtrace' for reading: No such file or directory
May 20 00:17:17 localhost abrt-notification[1889]: Process 1666 (memtester) crashed in ??()