AsRock Rack ROMED8-2T Map Memory Channel from OS to Physical Diagram

swein · February 7, 2024, 4:43pm

I have this motherboard: ROMED8-2T (won’t let me post link to manual)

populated with an Epyc 7302 and 8x32GB Crucial PC3200 RDIMMs.

syslog in Unraid is periodically giving me correctable errors:

Feb  7 09:24:27 Tower kernel: mce: [Hardware Error]: Machine check events logged
Feb  7 09:24:27 Tower kernel: [Hardware Error]: Corrected error, no action required.
Feb  7 09:24:27 Tower kernel: [Hardware Error]: CPU:3 (17:31:0) MC17_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
Feb  7 09:24:27 Tower kernel: [Hardware Error]: Error Addr: 0x0000000016042280
Feb  7 09:24:27 Tower kernel: [Hardware Error]: IPID: 0x0000009600650f00, Syndrome: 0x7b6a00020a800b00
Feb  7 09:24:27 Tower kernel: [Hardware Error]: Unified Memory Controller Ext. Error Code: 0, DRAM ECC error.
Feb  7 09:24:27 Tower kernel: EDAC MC0: 1 CE on mc#0csrow#0channel#6 (csrow:0 channel:6 page:0x58108 offset:0xa80 grain:64 syndrome:0x2)
Feb  7 09:24:27 Tower kernel: [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD

root@Tower:/var/log# grep "[0-9]" /sys/devices/system/edac/mc/mc*/csrow*/ch*_ce_count
/sys/devices/system/edac/mc/mc0/csrow0/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch3_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch4_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch5_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow0/ch6_ce_count:8
/sys/devices/system/edac/mc/mc0/csrow0/ch7_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch0_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch1_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch2_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch3_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch4_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch5_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch6_ce_count:0
/sys/devices/system/edac/mc/mc0/csrow1/ch7_ce_count:0

IPMI provides this:

EventID	Timestamp	GenID	Sensor Name	Sensor Number	Sensor Type	Sensor TypeCode	EvtDir Type	Event Data1	Event Data2	Event Data3	Description
0281	
Wednesday, February 7th 2024, 9:24:12 am
0021h	BIOS	00h	memory	0ch	6fh	00h	08h	00h	Correctable ECC - Asserted
Event Data1 Correctable ECC
Event Data2 N/A
Event Data3 N/A

Manual topography:

The manual seems to only provide Channels by physical description [A-F][1] = A1, B1, …, F1
Is there any way I can reliable map the OS reported “Channel 6” to the physical slot the memory is in?

xzpfzxds · February 7, 2024, 9:17pm

I have the same board and I have wondered the EDAC to DIMM slot mapping also.

So the three sources I know of:

EDAC - you have csrow:0 channel:6
IPMI - not sure what to make of the log you found, but there’s a “ch6(f)” in there - which may mean channel 6.
DMI - this reads the SPD from the DIMMs so gives you a DIMM serial number to slot name (provided by the board OEM) mapping.

But I’m not sure how to map the EDAC channel to the DMI channel or vice-versa.

Assuming they are both in the same order (so EDAC channel 0 = first DMI record), and the EDAC report is accurate (relies on the kernel driver), and the board DMI order is correct, then channel 6 would be DIMM slot G on that board.

Example: All channels on my board:

# dmidecode --type 17 | grep "Bank Locator"
	Bank Locator: P0 CHANNEL A
	Bank Locator: P0 CHANNEL B
	Bank Locator: P0 CHANNEL C
	Bank Locator: P0 CHANNEL D
	Bank Locator: P0 CHANNEL E
	Bank Locator: P0 CHANNEL F
	Bank Locator: P0 CHANNEL G
	Bank Locator: P0 CHANNEL H

Hacky bash script to merge the DMI records with the EDAC channels in the same order:

paste \
<(dmidecode --type 17 | grep -E '(Bank Locator|Serial Number)' | paste -sd ' \n') \
<(for f in /sys/devices/system/edac/mc/mc0/csrow0/ch*_ce_count ; do
  echo ce_count: $(cat $f)
done)

output:

 Bank Locator: P0 CHANNEL A Serial Number: 466Cxxxx ce_count: 0
 Bank Locator: P0 CHANNEL B Serial Number: 466Cxxxx ce_count: 0
 Bank Locator: P0 CHANNEL C Serial Number: 466Cxxxx ce_count: 0
 Bank Locator: P0 CHANNEL D Serial Number: 466Cxxxx ce_count: 0
 Bank Locator: P0 CHANNEL E Serial Number: 466Cxxxx ce_count: 0
 Bank Locator: P0 CHANNEL F Serial Number: 466Cxxxx ce_count: 0
 Bank Locator: P0 CHANNEL G Serial Number: 466Cxxxx ce_count: 0
 Bank Locator: P0 CHANNEL H Serial Number: 466Cxxxx ce_count: 0

Best guess I have - at least those letters are in the ASRockRack manual

swein · February 7, 2024, 10:45pm

In the 15 years of linux I’ve had personally and professionally, I’ve never once used paste, thanks for the help in troubleshooting. I’m just trying to avoid running memtest86 on 8 individual memory stick by themselves.

swein · February 9, 2024, 1:25am

I would have thought just counting that channel 6 would be F, but your script touching dmidecode is mapping to G.

I’ll try to replace G first and see if the errors continue, and fall back to trying F second if it still errors.

xzpfzxds · February 9, 2024, 1:26am

EDAC channels start at 0, so channel 6 is the 7th channel in that ordering.

slidermike · February 9, 2024, 2:31am

I would do a couple of things before going the replace route.

#1 Reseat and clean the two most likely DIMM modules.
They can be surprisingly sensitive to debris and not being seated well.
#2 While doing #1 above, swap the modules in the two slots & monitor for the ec to return or go away. If it went away, probably cleaning/reseating helped. If the error is present but moved to a new location, then it is pretty sure to be one of those two dimms.
Then you can just pull all the other modules and run memtest86 on two modules.

remember that occasional ec is not necessarily a sign of an issue. The chip is doing what it is supposed to. Many things can induce an error including cosmic radiation.
Also, electrical crosstalk etc…
The point is, the occasional reported ec is not necessarily a bad sign.
That said, UC errors are not good and repeated errors of either type can be a sign of a chip having issues.

swein · February 9, 2024, 4:08am

Thanks for the advice.

This same chip has throw 50 or so corrected bits over 6 months which seems high. I already have a spare ram stick for it. but doing your troubleshooting steps seem smart.

system · November 8, 2024, 10:08pm

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.