ECC not configured in linux? memtest86 can see the errors... ras-mc-ctl --layout says I have 0mb

So I’ve been using linux for a few years as my daily driver but this is definitely a lot further into the guts than I normally go. And I have no practical experience with ECC.

I’m building a new system and I wanted to get ECC this time around. I’m validating the ECC via ram overclocking (system POSTs about 50% of the time) and have gotten memtest86 to find 518 1 bit errors in 1 hour. But after a night (7 hours or so) of prime95 blended stress testing linux reported no ECC errors. I also ran Phoronix benchmark memory to no avail.

It could just be that memtest86 is fricking fantastic at causing memory errors compared to most stuff inside an OS… but based on the output of the ras-mc-ctl --layout I tend to think stuff isn’t configured correctly.

I am familiar with the thread on introducing ECC errors via hardware. I am currently too chicken to risk damaging my new equipment maybe I’ll be able to convince myself it’s safe and give it a shot later.

So a few questions:

On linux, what do I have to check to have good confidence that any ECC the motherboard passes along are processed by linux?

What does the ras-mc-ctl --layout output look like from from a linux system that has reported 1 bit and/or 2 bit ECC errors?

Do you think things are working fine and I just didn’t trigger a 1 bit ECC error inside of linux?

Here’s some relevant output from the console:

[root@fedora ~]# dmidecode -t memory | grep ECC
	Error Correction Type: Multi-bit ECC


[root@fedora ~]# cat /etc/system-release
Fedora release 34 (Thirty Four)


[root@fedora ~]# ras-mc-ctl --mainboard
ras-mc-ctl: mainboard: ASRock model X570 Taichi


CPU: AMD 5600x
[root@fedora ~]# /usr/sbin/rasdaemon --version
rasdaemon 0.6.4


[root@fedora ~]# ras-mc-ctl --layout
Use of uninitialized value $max_pos[3] in modulus (%) at /usr/sbin/ras-mc-ctl line 868.
Use of uninitialized value $d in numeric ge (>=) at /usr/sbin/ras-mc-ctl line 869.
Use of uninitialized value $d in sprintf at /usr/sbin/ras-mc-ctl line 872.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/sbin/ras-mc-ctl line 791.
    +-----------------------------------------------------------------------+
    |                                  mc0                                  |
    |        csrow0         |        csrow1         |        csrow2         |
    | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  |
----+-----------------------------------------------------------------------+

0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----+-----------------------------------------------------------------------+


[root@fedora ~]# ras-mc-ctl --error-count
Label               	CE	UE
mc#0csrow#3channel#0	0	0
mc#0csrow#2channel#1	0	0
mc#0csrow#3channel#1	0	0
mc#0csrow#2channel#0	0	0


[root@fedora ~]# dnf info rasdaemon
Last metadata expiration check: 0:08:46 ago on Fri 22 Oct 2021 12:10:04 AM EDT.
Installed Packages
Name         : rasdaemon
Version      : 0.6.4
Release      : 4.fc34
Architecture : x86_64
Size         : 285 k
Source       : rasdaemon-0.6.4-4.fc34.src.rpm
Repository   : @System
From repo    : fedora
Summary      : Utility to receive RAS error tracings
URL          : http://git.infradead.org/users/mchehab/rasdaemon.git
License      : GPLv2
Description  : rasdaemon is a RAS (Reliability, Availability and Serviceability)
             : logging tool. It currently records memory errors, using the EDAC
             : tracing events. EDAC is drivers in the Linux kernel that handle
             : detection of ECC errors from memory controllers for most chipsets
             : on i386 and x86_64 architectures. EDAC drivers for other
             : architectures like arm also exists. This userspace component
             : consists of an init script which makes sure EDAC drivers and DIMM
             : labels are loaded at system startup, as well as an utility for
             : reporting current error counts from the EDAC sysfs files.

V0.6.4 is 2 years old so I did go to rasdaemon’s git and tried compiling v0.6.6 & v0.6.7 (6.7 is 4 months old), while some of the switches don’t work any more --layout has nearly identical output.

[root@fedora rasdaemon]# rasdaemon --version
rasdaemon 0.6.6


compile time options summary
============================

    Sqlite3             : no
    AER                 : no
    MCE                 : no
    EXTLOG              : no
    CPER non-standard   : no
    ABRT report         : no
    HIP07 SAS HW errors : no
    ARM events          : no
    DEVLINK             : no
    Disk I/O errors     : no
    Memory CE PFA       : no


[root@fedora rasdaemon]# ras-mc-ctl --layout
Use of uninitialized value $max_pos[3] in modulus (%) at /usr/local/sbin/ras-mc-ctl line 868.
Use of uninitialized value $d in numeric ge (>=) at /usr/local/sbin/ras-mc-ctl line 869.
Use of uninitialized value $d in sprintf at /usr/local/sbin/ras-mc-ctl line 872.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 791.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 791.
    +-----------------------------------------------------------------------+
    |                                  mc0                                  |
    |        csrow0         |        csrow1         |        csrow2         |
    | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  |
----+-----------------------------------------------------------------------+

0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----+-----------------------------------------------------------------------+


[root@fedora rasdaemon]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at /usr/local/sbin/ras-mc-ctl line 1243.
ras-mc-ctl: Error: mc_event table missing from /usr/local/var/lib/rasdaemon/ras-mc_event.db. Run 'rasdaemon --record'.


[root@fedora rasdaemon]# ras-mc-ctl --summary
DBD::SQLite::db prepare failed: no such table: mc_event at /usr/local/sbin/ras-mc-ctl line 1131.
Can't call method "execute" on an undefined value at /usr/local/sbin/ras-mc-ctl line 1132.


[root@fedora rasdaemon]# ras-mc-ctl --error-count
Label               	CE	UE
mc#0csrow#3channel#1	0	0
mc#0csrow#2channel#1	0	0
mc#0csrow#2channel#0	0	0
mc#0csrow#3channel#0	0	0
[root@fedora rasdaemon]# rasdaemon --version
rasdaemon 0.6.7


compile time options summary
============================

    Sqlite3             : no
    AER                 : no
    MCE                 : no
    EXTLOG              : no
    CPER non-standard   : no
    ABRT report         : no
    HISI Kunpeng errors : no
    ARM events          : no
    DEVLINK             : no
    Disk I/O errors     : no
    Memory Failure      : no
    Memory CE PFA       : no
    AMP RAS errors      : no



[root@fedora rasdaemon]# ras-mc-ctl --layout
Use of uninitialized value $max_pos[3] in modulus (%) at /usr/local/sbin/ras-mc-ctl line 905.
Use of uninitialized value $d in numeric ge (>=) at /usr/local/sbin/ras-mc-ctl line 906.
Use of uninitialized value $d in sprintf at /usr/local/sbin/ras-mc-ctl line 909.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
Use of uninitialized value $pos[3] in join or string at /usr/local/sbin/ras-mc-ctl line 828.
    +-----------------------------------------------------------------------------------------------+
    |                                              mc0                                              |
    |        csrow0         |        csrow1         |        csrow2         |        csrow3         |
    | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  | channel0  | channel1  |
----+-----------------------------------------------------------------------------------------------+

0: |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |     0 MB  |
----+-----------------------------------------------------------------------------------------------+


[root@fedora rasdaemon]# ras-mc-ctl --error-count
Label               	CE	UE
mc#0csrow#2channel#0	0	0
mc#0csrow#3channel#1	0	0
mc#0csrow#3channel#0	0	0
mc#0csrow#2channel#1	0	0


[root@fedora rasdaemon]# ras-mc-ctl --summary
DBD::SQLite::db prepare failed: no such table: mc_event at /usr/local/sbin/ras-mc-ctl line 1169.
Can't call method "execute" on an undefined value at /usr/local/sbin/ras-mc-ctl line 1170.


[root@fedora rasdaemon]# ras-mc-ctl --errors
DBD::SQLite::db prepare failed: no such table: mc_event at /usr/local/sbin/ras-mc-ctl line 1329.
ras-mc-ctl: Error: mc_event table missing from /usr/local/var/lib/rasdaemon/ras-mc_event.db. Run 'rasdaemon --record'.

I think rasdaemon just uses the EDAC interface, so you could try querying that directly: edac-util -v.

On an old machine I have which edac-util reports corrected errors, ras-mc-ctl throws errors about dimm_ce_count not found in sysfs, never looked into it myself.

@xzpfzxds
Is edac-util still in good working order for current distros?

I was hesitant to use it because I found two knowledgable seeming sources seeming to suggest not using it.

The first was memtest86’s ECC article mentioned that mcelog and edac-utils were deprecated.

The one that gave me the most pause was the entry in the rasdaemon readme.

Its initial goal is to replace the edac-tools that got bitroted after the addition of the HERM (Hardware Events Report Method )patches[1] at the EDAC Kernel drivers.
source: git.infradead.org Git - users/mchehab/rasdaemon.git/blob - README

I’m iffy on exactly what that’s supposed to mean but the best fit I found was from wikipedia on software rot

Rarely updated code
See also: Dependency hell
Normal maintenance of software and systems may also cause software rot. In particular, when a program contains multiple parts which function at arm’s length from one another, failing to consider how changes to one part affect the others may introduce bugs.
In some cases, this may take the form of libraries that the software uses being changed in a way which adversely affects the software. If the old version of a library that previously worked with the software can no longer be used due to conflicts with other software or security flaws that were found in the old version, there may no longer be a viable version of a needed library for the program to use.
Source: Software rot - Wikipedia

Unfortunately that doesn’t really clear up the issue if it’s in good working order or just a pain to work with, the most I can point to is that the git doesn’t seem to have been updated since 2015.

Here’s the output on edac-util -v, it looks equivalent to the ras-mc-ctl --error-count

[root@fedora rasdaemon]# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
mc0: csrow3: 0 Uncorrected Errors
mc0: csrow3: mc#0csrow#3channel#0: 0 Corrected Errors
mc0: csrow3: mc#0csrow#3channel#1: 0 Corrected Errors
edac-util: No errors to report.

I’ve never had a problem with edac-util, it’s really just a frontend for the EDAC interface exposed by the kernel via sysfs (/sys/devices/system/edac/) so there’s not much to go wrong, as long as the EDAC modules are in working order.

Seems to work fine here, on latest Arch here (kernel 5.14.14, edac-utils 0.18-3) on EPYC Milan and Xeon E3 Haswell/Broadwell.

Replying to an old thread to point out this newer thread with more up to date resources on ECC in linux. Anyone finding this old thread is probably better off starting there.