EPYC Workstation kernel panics on linux

I recently put together an EPYC workstation and Ive been having some weird issues.
I have an ASrock Rack EPYCD8-2T (latest 2.60 bios), which I have paired with a 7451, and 128GB Micron RDIMM (4 channel).
Any attempt to install or run linux on this system has failed.
The system begins the boot process then errors out and boot loops.
I heard this motherboard was picky about usb keys, so I tried 5 different keys.
I have also tried Ubuntu and Fedora, both desktop and server variants.
I took the NVME out and installed Ubuntu on a 1st gen Ryzen desktop and get the same behavior once the NVME is back in the EPYC.
Returning the NVME to the Ryzen system, it boots fine with no signs of data corruption.
Checking dmesg I do see these messages about ECC:

Blockquote[ 5.309263] kernel: EDAC amd64: Node 0: DRAM ECC disabled.
[ 5.309264] kernel: EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting ‘ecc_enable_override’.
(Note that use of the override may cause unknown side effects.)
[ 5.309788] kernel: kvm: disabled by bios
[ 5.421029] kernel: EDAC amd64: Node 0: DRAM ECC disabled.
[ 5.421031] kernel: EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
Either enable ECC checking or force module loading by setting ‘ecc_enable_override’.
(Note that use of the override may cause unknown side effects.)

Blockquote
Im not sure if this message is from the EPYC system, or if Im getting this by returning the NVME to the Ryzen system which does not have ECC.

Plugging in a SSD with windows 10 on it however boots perfectly and runs without any issues.
Running cinebench r15 and r20 or blender works perfectly and produces some great results.
Running the windows memory diagnostic shows no issues with memory.
It seems like it should be a bios config issue, or I have seen some posts related to an early microcode setting?
But I haven’t been able to pinpoint the issue.
Has anyone run into this type of an error?
Thanks for your help

2 Likes

I think I may have solved my issue.
It was the NVME all along! or at least this NVME with this motherboard.
I had the NVME in the Ryzen system and the EPYC booted my fedora server key.
I tried the NVME in both slots with the same result.
I then reseated the CPU thinking it was a mounting pressure issue, with no improvement.
Finally I pulled a different NVME out of a different linux PC, and it booted fine with that NVME.
However if both are installed, even if Im booting the working NVME, Linux will crash.
Not sure why this didnt affect windows.
Maybe the NVME was screaming across the PCIe bus this whole time and windows just ignored it…
This issue could have something to do with speed.
The problematic NVME was a XPG GAMMIX S11 Pro, which is quite a bit faster than the working Crucial P1.
Either way, this NVME works fine in other systems, just not here.

3 Likes