Continual XFS Metadata Corruption

Just an update - finished updating hypervisor to 4.14.15, updated oVirt to 4.2 and all associated packages. I also checked in the BIOS and I couldn’t find any options regarding ECC, but I did search a couple things and from what I can tell even though it never reports as enabled with dmesg based on the data widths and other information from dmidecode -t memory it should be enabled - reporting data width as 64 and total width 72. I also recreated the RAID array and set up XFS on it directly, it’s still copying over my data but over 5TB copied so far and no errors.

# dmidecode 3.0
Getting SMBIOS data from sysfs.
SMBIOS 3.0.0 present.

Handle 0x002A, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 128 GB
	Error Information Handle: Not Provided
	Number Of Devices: 2

Handle 0x002B, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x002A
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMA1
	Bank Locator: P0_Node0_Channel0_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2400 MHz
	Manufacturer: Kingston
	Serial Number: E80C3B3A
	Asset Tag: P1-DIMMA1_AssetTag (date:16/49)
	Part Number: 9965604-008.D00G   
	Rank: 2
	Configured Clock Speed: 2400 MHz
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x002C, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x002A
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMB1
	Bank Locator: P0_Node0_Channel1_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2400 MHz
	Manufacturer: Kingston
	Serial Number: ED0C3A3A
	Asset Tag: P1-DIMMB1_AssetTag (date:16/49)
	Part Number: 9965604-008.D00G   
	Rank: 2
	Configured Clock Speed: 2400 MHz
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x002D, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 128 GB
	Error Information Handle: Not Provided
	Number Of Devices: 2

Handle 0x002E, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x002D
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMC1
	Bank Locator: P0_Node0_Channel2_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: NO DIMM
	Serial Number: NO DIMM
	Asset Tag: NO DIMM
	Part Number: NO DIMM
	Rank: Unknown
	Configured Clock Speed: Unknown
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x002F, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x002D
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: DIMM
	Set: None
	Locator: P1-DIMMD1
	Bank Locator: P0_Node0_Channel3_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: NO DIMM
	Serial Number: NO DIMM
	Asset Tag: NO DIMM
	Part Number: NO DIMM
	Rank: Unknown
	Configured Clock Speed: Unknown
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x0030, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 128 GB
	Error Information Handle: Not Provided
	Number Of Devices: 2

Handle 0x0031, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x0030
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: P2-DIMME1
	Bank Locator: P1_Node1_Channel0_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2400 MHz
	Manufacturer: Kingston
	Serial Number: E90C463A
	Asset Tag: P2-DIMME1_AssetTag (date:16/49)
	Part Number: 9965604-008.D00G   
	Rank: 2
	Configured Clock Speed: 2400 MHz
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x0032, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x0030
	Error Information Handle: Not Provided
	Total Width: 72 bits
	Data Width: 64 bits
	Size: 16384 MB
	Form Factor: DIMM
	Set: None
	Locator: P2-DIMMF1
	Bank Locator: P1_Node1_Channel1_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: 2400 MHz
	Manufacturer: Kingston
	Serial Number: EE0C393A
	Asset Tag: P2-DIMMF1_AssetTag (date:16/49)
	Part Number: 9965604-008.D00G   
	Rank: 2
	Configured Clock Speed: 2400 MHz
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x0033, DMI type 16, 23 bytes
Physical Memory Array
	Location: System Board Or Motherboard
	Use: System Memory
	Error Correction Type: Multi-bit ECC
	Maximum Capacity: 128 GB
	Error Information Handle: Not Provided
	Number Of Devices: 2

Handle 0x0034, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x0033
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: DIMM
	Set: None
	Locator: P2-DIMMG1
	Bank Locator: P1_Node1_Channel2_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: NO DIMM
	Serial Number: NO DIMM
	Asset Tag: NO DIMM
	Part Number: NO DIMM
	Rank: Unknown
	Configured Clock Speed: Unknown
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

Handle 0x0035, DMI type 17, 40 bytes
Memory Device
	Array Handle: 0x0033
	Error Information Handle: Not Provided
	Total Width: Unknown
	Data Width: Unknown
	Size: No Module Installed
	Form Factor: DIMM
	Set: None
	Locator: P2-DIMMH1
	Bank Locator: P1_Node1_Channel3_Dimm0
	Type: DDR4
	Type Detail: Synchronous
	Speed: Unknown
	Manufacturer: NO DIMM
	Serial Number: NO DIMM
	Asset Tag: NO DIMM
	Part Number: NO DIMM
	Rank: Unknown
	Configured Clock Speed: Unknown
	Minimum Voltage: Unknown
	Maximum Voltage: Unknown
	Configured Voltage: Unknown

You may need to load the EDAC (Error Detection & Correction) kernel modules to see ECC RAM stats.
On my AMD Ryzen & Phenom systems I have to modprobe amd64_edac_mod

You can do find /lib/modules/$(uname -r) -iname '*edac*' to see what EDAC modules you’ve got.

Then dmesg | grep -i edac

Example:

root@deepthought:~# dmesg | grep -i edac
[    0.294103] EDAC MC: Ver: 3.0.0
[   17.751446] EDAC amd64: Node 0: DRAM ECC enabled.
[   17.751447] EDAC amd64: F17h detected (node 0).
[   17.751490] EDAC MC: UMC0 chip selects:
[   17.751491] EDAC amd64: MC: 0:     0MB 1:     0MB
[   17.751492] EDAC amd64: MC: 2: 16383MB 3: 16383MB
[   17.751492] EDAC amd64: MC: 4:     0MB 5:     0MB
[   17.751493] EDAC amd64: MC: 6:     0MB 7:     0MB
[   17.751496] EDAC MC: UMC1 chip selects:
[   17.751496] EDAC amd64: MC: 0:     0MB 1:     0MB
[   17.751497] EDAC amd64: MC: 2: 16383MB 3: 16383MB
[   17.751498] EDAC amd64: MC: 4:     0MB 5:     0MB
[   17.751499] EDAC amd64: MC: 6:     0MB 7:     0MB
[   17.751499] EDAC amd64: using x8 syndromes.
[   17.751500] EDAC amd64: MCT channel count: 2
[   17.751616] EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
[   17.751627] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[   17.751627] AMD64 EDAC driver v3.5.0

Thanks for mentioning that, I found the following in dmesg related to the sb_edac module.

[    0.580152] EDAC MC: Ver: 3.0.0
[    1.645096] ghes_edac: This EDAC driver relies on BIOS to enumerate memory and get error reports.
[    1.645097] ghes_edac: Unfortunately, not all BIOSes reflect the memory layout correctly.
[    1.645097] ghes_edac: So, the end result of using this driver varies from vendor to vendor.
[    1.645098] ghes_edac: If you find incorrect reports, please contact your hardware vendor
[    1.645098] ghes_edac: to correct its BIOS.
[    1.645099] ghes_edac: This system has 8 DIMM sockets.
[    1.645230] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
[    1.645305] EDAC MC1: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
[    7.406799] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[    7.406819] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[    7.406839] EDAC sbridge: Seeking for: PCI ID 8086:6fa0
[    7.406859] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[    7.406865] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[    7.406874] EDAC sbridge: Seeking for: PCI ID 8086:6f60
[    7.406880] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[    7.406886] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[    7.406894] EDAC sbridge: Seeking for: PCI ID 8086:6fa8
[    7.406902] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[    7.406908] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[    7.406916] EDAC sbridge: Seeking for: PCI ID 8086:6f71
[    7.406922] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[    7.406927] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[    7.406933] EDAC sbridge: Seeking for: PCI ID 8086:6faa
[    7.406938] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[    7.406943] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[    7.406949] EDAC sbridge: Seeking for: PCI ID 8086:6fab
[    7.406954] EDAC sbridge: Seeking for: PCI ID 8086:6fac
[    7.406965] EDAC sbridge: Seeking for: PCI ID 8086:6fad
[    7.406976] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[    7.406981] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[    7.406988] EDAC sbridge: Seeking for: PCI ID 8086:6f68
[    7.406992] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[    7.407009] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[    7.407015] EDAC sbridge: Seeking for: PCI ID 8086:6f79
[    7.407019] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[    7.407025] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[    7.407032] EDAC sbridge: Seeking for: PCI ID 8086:6f6a
[    7.407036] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[    7.407041] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[    7.407048] EDAC sbridge: Seeking for: PCI ID 8086:6f6b
[    7.407052] EDAC sbridge: Seeking for: PCI ID 8086:6f6c
[    7.407063] EDAC sbridge: Seeking for: PCI ID 8086:6f6d
[    7.407074] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[    7.407078] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[    7.407084] EDAC sbridge: Seeking for: PCI ID 8086:6ffc
[    7.407090] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[    7.407094] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[    7.407100] EDAC sbridge: Seeking for: PCI ID 8086:6ffd
[    7.407106] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[    7.407110] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[    7.407117] EDAC sbridge: Seeking for: PCI ID 8086:6faf
[    7.407180] EDAC sbridge: Couldn't find mci handler
[    7.407181] EDAC sbridge: Couldn't find mci handler
[    7.407182] EDAC sbridge: Couldn't find mci handler
[    7.407182] EDAC sbridge: Couldn't find mci handler
[    7.407185] EDAC sbridge: Failed to register device with error -22.

edac-util -vs

edac-util: EDAC drivers are loaded. 2 MCs detected:
  mc0:ghes_edac
  mc1:ghes_edac

edac-util -vr

mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 0 Uncorrected Errors
mc0: csrow0: mc#0memory#0: 0 Corrected Errors
mc0: csrow1: 0 Uncorrected Errors
mc0: csrow1: mc#0memory#1: 0 Corrected Errors
mc0: csrow4: 0 Uncorrected Errors
mc0: csrow4: mc#0memory#4: 0 Corrected Errors
mc0: csrow5: 0 Uncorrected Errors
mc0: csrow5: mc#0memory#5: 0 Corrected Errors
mc1: 0 Uncorrected Errors with no DIMM info
mc1: 0 Corrected Errors with no DIMM info

I’ve finished copying back all the data and so far no corruption but it’s almost a fresh start so it may take a while to reoccur. I’ll follow up if it does, I’m starting to use it like usual again.

I found I did not have mcelog installed so I’ve installed that now, it may prove useful.

From what I can find chips with a built in memory control will throw machine check errors (I’ve seen these before when overclocking my main rig) when an error comes up, edac is older.

mcelog

Usage:

  mcelog [options]  [mcelogdevice]
Decode machine check error records from current kernel.
..........
--cpu CPU           Set CPU type CPU to decode (see below for valid types)
..........
Valid CPUs: generic p6old core2 k8 p4 dunnington xeon74xx xeon7400 xeon5500 xeon5200 xeon5000 xeon5100 xeon3100 xeon3200 core_i7 core_i5 core_i3 nehalem westmere xeon71xx xeon7100 tulsa intel xeon75xx xeon7500 xeon7200 xeon7100 sandybridge sandybridge-ep ivybridge ivybridge-ep ivybridge-ex haswell haswell-ep haswell-ex broadwell broadwell-d broadwell-ep broadwell-ex knightslanding knightsmill xeon-v4 atom skylake skylake_server kabylake denverton

While they are ES cpus (0000) they should still be Broadwell EP which appears supported.