Just an update - finished updating hypervisor to 4.14.15, updated oVirt to 4.2 and all associated packages. I also checked in the BIOS and I couldn’t find any options regarding ECC, but I did search a couple things and from what I can tell even though it never reports as enabled with dmesg based on the data widths and other information from dmidecode -t memory it should be enabled - reporting data width as 64 and total width 72. I also recreated the RAID array and set up XFS on it directly, it’s still copying over my data but over 5TB copied so far and no errors.
# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.
Handle 0x002A, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 128 GB
Error Information Handle: Not Provided
Number Of Devices: 2
Handle 0x002B, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMA1
Bank Locator: P0_Node0_Channel0_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Kingston
Serial Number: E80C3B3A
Asset Tag: P1-DIMMA1_AssetTag (date:16/49)
Part Number: 9965604-008.D00G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x002C, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002A
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P1-DIMMB1
Bank Locator: P0_Node0_Channel1_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Kingston
Serial Number: ED0C3A3A
Asset Tag: P1-DIMMB1_AssetTag (date:16/49)
Part Number: 9965604-008.D00G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x002D, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 128 GB
Error Information Handle: Not Provided
Number Of Devices: 2
Handle 0x002E, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002D
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: DIMM
Set: None
Locator: P1-DIMMC1
Bank Locator: P0_Node0_Channel2_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x002F, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x002D
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: DIMM
Set: None
Locator: P1-DIMMD1
Bank Locator: P0_Node0_Channel3_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x0030, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 128 GB
Error Information Handle: Not Provided
Number Of Devices: 2
Handle 0x0031, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0030
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMME1
Bank Locator: P1_Node1_Channel0_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Kingston
Serial Number: E90C463A
Asset Tag: P2-DIMME1_AssetTag (date:16/49)
Part Number: 9965604-008.D00G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x0032, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0030
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 16384 MB
Form Factor: DIMM
Set: None
Locator: P2-DIMMF1
Bank Locator: P1_Node1_Channel1_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: 2400 MHz
Manufacturer: Kingston
Serial Number: EE0C393A
Asset Tag: P2-DIMMF1_AssetTag (date:16/49)
Part Number: 9965604-008.D00G
Rank: 2
Configured Clock Speed: 2400 MHz
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x0033, DMI type 16, 23 bytes
Physical Memory Array
Location: System Board Or Motherboard
Use: System Memory
Error Correction Type: Multi-bit ECC
Maximum Capacity: 128 GB
Error Information Handle: Not Provided
Number Of Devices: 2
Handle 0x0034, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0033
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: DIMM
Set: None
Locator: P2-DIMMG1
Bank Locator: P1_Node1_Channel2_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
Handle 0x0035, DMI type 17, 40 bytes
Memory Device
Array Handle: 0x0033
Error Information Handle: Not Provided
Total Width: Unknown
Data Width: Unknown
Size: No Module Installed
Form Factor: DIMM
Set: None
Locator: P2-DIMMH1
Bank Locator: P1_Node1_Channel3_Dimm0
Type: DDR4
Type Detail: Synchronous
Speed: Unknown
Manufacturer: NO DIMM
Serial Number: NO DIMM
Asset Tag: NO DIMM
Part Number: NO DIMM
Rank: Unknown
Configured Clock Speed: Unknown
Minimum Voltage: Unknown
Maximum Voltage: Unknown
Configured Voltage: Unknown
You may need to load the EDAC (Error Detection & Correction) kernel modules to see ECC RAM stats.
On my AMD Ryzen & Phenom systems I have to modprobe amd64_edac_mod
You can do find /lib/modules/$(uname -r) -iname '*edac*' to see what EDAC modules you’ve got.
Thanks for mentioning that, I found the following in dmesg related to the sb_edac module.
[ 0.580152] EDAC MC: Ver: 3.0.0
[ 1.645096] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports.
[ 1.645097] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly.
[ 1.645097] ghes_edac: So, the end result of using this driver varies from vendor to vendor.
[ 1.645098] ghes_edac: If you find incorrect reports, please contact your hardware vendor
[ 1.645098] ghes_edac: to correct its BIOS.
[ 1.645099] ghes_edac: This system has 8 DIMM sockets.
[ 1.645230] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
[ 1.645305] EDAC MC1: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
[ 7.406799] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 7.406819] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 7.406839] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[ 7.406859] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[ 7.406865] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[ 7.406874] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[ 7.406880] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[ 7.406886] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[ 7.406894] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[ 7.406902] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[ 7.406908] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[ 7.406916] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[ 7.406922] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[ 7.406927] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[ 7.406933] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[ 7.406938] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[ 7.406943] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[ 7.406949] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[ 7.406954] EDAC sbridge: Seeking for: PCI ID 8086:6fac
[ 7.406965] EDAC sbridge: Seeking for: PCI ID 8086:6fad
[ 7.406976] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[ 7.406981] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[ 7.406988] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[ 7.406992] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[ 7.407009] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[ 7.407015] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[ 7.407019] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[ 7.407025] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[ 7.407032] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[ 7.407036] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[ 7.407041] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[ 7.407048] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[ 7.407052] EDAC sbridge: Seeking for: PCI ID 8086:6f6c
[ 7.407063] EDAC sbridge: Seeking for: PCI ID 8086:6f6d
[ 7.407074] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[ 7.407078] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[ 7.407084] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[ 7.407090] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[ 7.407094] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[ 7.407100] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[ 7.407106] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[ 7.407110] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[ 7.407117] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[ 7.407180] EDAC sbridge: Couldn't find mci handler
[ 7.407181] EDAC sbridge: Couldn't find mci handler
[ 7.407182] EDAC sbridge: Couldn't find mci handler
[ 7.407182] EDAC sbridge: Couldn't find mci handler
[ 7.407185] EDAC sbridge: Failed to register device with error -22.
edac-util -vs
edac-util: EDAC drivers are loaded. 2 MCs detected:
mc0:ghes_edac
mc1:ghes_edac
edac-util -vr
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0memory#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0memory#1: 0 Corrected Errors
mc0: csrow4: 0 Uncorrected Errors
mc0: csrow4: mc#0memory#4: 0 Corrected Errors
mc0: csrow5: 0 Uncorrected Errors
mc0: csrow5: mc#0memory#5: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info
I’ve finished copying back all the data and so far no corruption but it’s almost a fresh start so it may take a while to reoccur. I’ll follow up if it does, I’m starting to use it like usual again.
I found I did not have mcelog installed so I’ve installed that now, it may prove useful.
From what I can find chips with a built in memory control will throw machine check errors (I’ve seen these before when overclocking my main rig) when an error comes up, edac is older.
mcelog
Usage:
mcelog [options] [mcelogdevice]
Decode machine check error records from current kernel.
..........
--cpu CPU Set CPU type CPU to decode (see below for valid types)
..........
Valid CPUs: generic p6old core2 k8 p4 dunnington xeon74xx xeon7400 xeon5500 xeon5200 xeon5000 xeon5100 xeon3100 xeon3200 core_i7 core_i5 core_i3 nehalem westmere xeon71xx xeon7100 tulsa intel xeon75xx xeon7500 xeon7200 xeon7100 sandybridge sandybridge-ep ivybridge ivybridge-ep ivybridge-ex haswell haswell-ep haswell-ex broadwell broadwell-d broadwell-ep broadwell-ex knightslanding knightsmill xeon-v4 atom skylake skylake_server kabylake denverton
While they are ES cpus (0000) they should still be Broadwell EP which appears supported.