The manual seems to only provide Channels by physical description [A-F][1] = A1, B1, …, F1
Is there any way I can reliable map the OS reported “Channel 6” to the physical slot the memory is in?
I have the same board and I have wondered the EDAC to DIMM slot mapping also.
So the three sources I know of:
EDAC - you have csrow:0 channel:6
IPMI - not sure what to make of the log you found, but there’s a “ch6(f)” in there - which may mean channel 6.
DMI - this reads the SPD from the DIMMs so gives you a DIMM serial number to slot name (provided by the board OEM) mapping.
But I’m not sure how to map the EDAC channel to the DMI channel or vice-versa.
Assuming they are both in the same order (so EDAC channel 0 = first DMI record), and the EDAC report is accurate (relies on the kernel driver), and the board DMI order is correct, then channel 6 would be DIMM slot G on that board.
Example: All channels on my board:
# dmidecode --type 17 | grep "Bank Locator"
Bank Locator: P0 CHANNEL A
Bank Locator: P0 CHANNEL B
Bank Locator: P0 CHANNEL C
Bank Locator: P0 CHANNEL D
Bank Locator: P0 CHANNEL E
Bank Locator: P0 CHANNEL F
Bank Locator: P0 CHANNEL G
Bank Locator: P0 CHANNEL H
Hacky bash script to merge the DMI records with the EDAC channels in the same order:
paste \
<(dmidecode --type 17 | grep -E '(Bank Locator|Serial Number)' | paste -sd ' \n') \
<(for f in /sys/devices/system/edac/mc/mc0/csrow0/ch*_ce_count ; do
echo ce_count: $(cat $f)
done)
output:
Bank Locator: P0 CHANNEL A Serial Number: 466Cxxxx ce_count: 0
Bank Locator: P0 CHANNEL B Serial Number: 466Cxxxx ce_count: 0
Bank Locator: P0 CHANNEL C Serial Number: 466Cxxxx ce_count: 0
Bank Locator: P0 CHANNEL D Serial Number: 466Cxxxx ce_count: 0
Bank Locator: P0 CHANNEL E Serial Number: 466Cxxxx ce_count: 0
Bank Locator: P0 CHANNEL F Serial Number: 466Cxxxx ce_count: 0
Bank Locator: P0 CHANNEL G Serial Number: 466Cxxxx ce_count: 0
Bank Locator: P0 CHANNEL H Serial Number: 466Cxxxx ce_count: 0
Best guess I have - at least those letters are in the ASRockRack manual
In the 15 years of linux I’ve had personally and professionally, I’ve never once used paste, thanks for the help in troubleshooting. I’m just trying to avoid running memtest86 on 8 individual memory stick by themselves.
I would do a couple of things before going the replace route.
#1 Reseat and clean the two most likely DIMM modules.
They can be surprisingly sensitive to debris and not being seated well. #2 While doing #1 above, swap the modules in the two slots & monitor for the ec to return or go away. If it went away, probably cleaning/reseating helped. If the error is present but moved to a new location, then it is pretty sure to be one of those two dimms.
Then you can just pull all the other modules and run memtest86 on two modules.
remember that occasional ec is not necessarily a sign of an issue. The chip is doing what it is supposed to. Many things can induce an error including cosmic radiation.
Also, electrical crosstalk etc…
The point is, the occasional reported ec is not necessarily a bad sign.
That said, UC errors are not good and repeated errors of either type can be a sign of a chip having issues.
This same chip has throw 50 or so corrected bits over 6 months which seems high. I already have a spare ram stick for it. but doing your troubleshooting steps seem smart.