No POST on an ASRock X399 TaiChi after attempting to launch a libvirt VM with PCI passthrough

Long time lurker, first time poster. I recently built a X399 TaiChi based system with the hope of running a PCI Pass Through Windows VM on it. I just got the board back from RMA and seem to have found the same problem. I wanted to post here and see if there is some way to recover it beyond yet another RMA since I have no idea what has caused it.

The system configuration is two sets of Corsair Dominator Platinum 3200 64G kits, a Threadripper 2950X processor, a Samsung 970 EVO 1TB NVMe M.2 SSD in the M2_1 slot, a Crucial P1 1TB NVMe M.2 SSD in the M2_2 slot, MSI Armor RX 590 in PCEI1, XFX RX 570 in PCIE4, Inateck USB 3.0 Host Controller in PCIE5. Following directions on the Arch wiki I had setup a boot script to map the USB Controller, RX 570, and Crucial drives to the fvio_pci device for pass-through.

I’m running Arch Linux, synced last night to the latest packages. Running kernel 5.2.14.arch2-1 and was not having any issues. Tried to create a Windows VM using virt-manager and the operation failed with a BAR size error. I removed the GPU PCI devices from the image and tried again, and libvirt hung. I attempted to restart the libvirtd process, and that did not respond either, but everything else worked. I decided to reboot and verify there were no configuration issues in the BIOS (I had already enabled IOMMU, SR-IOV, set DRAM timing from the Rzyen Calculator, etc). The system shutdown normally but then did not POST after the kernel halted. The power button was also non-responsive. So I switched off the PSU.

At this point, this may sound like a Linux problem but what happens next is why I posted in the hardware thread.

Powering the system back on, the ASRock Gear lights up, but the DRAM LEDs do not and the system will not boot from the case or motherboard power button. I can load a BIOS image on to a USB drive and use the FlackBack option, which does work (the light on the board and on the FlashBack botton all blink, as does the access light on my USB drive) but the system will not POST. A similar thing happened with my system when I first assembled it, and I went through an RMA with ASRock and just got the new Motherboard yesterday. The telltale sign is that when power is applied to the system, the DRAM LEDs do not light up. Some digging through various forum threads before lead me to a firmware issue where the firmware sometimes gets “stuck”, but I don’t see any way to correct this problem. I do not know how I’m causing this fault, or if there is any way to fix it, but figured I would reach out for help here before trying to go down the RMA path again.

I’m mostly concerned that since I don’t know what’s causing the fault I don’t know what needs to be done to prevent it. Lots of posts here and elsewhere all show people running VMs on Linux without issue. Any pointers or suggestions would be appreciated!

-Carl

After leaving the system powered off all night with the CMOS battery removed, the RAM is initializing again (lights up when power is applied) and I can get it to work for one or two boots before it returns to the same bad state. I believe the issue at this point is the RAM. The Corsair Dominator Platinum memory I purchased (CMD64GX4M4C3200C16) isn’t on the QVL. To test this theory, I purchased a cheap G.SKILL FORTIS kit that is on the QVL (F4-2400C16D-16GFT) to see if the problem persists. If it does, then I’ll find a way to sell off the Dominator Platinum RAM (return windows have passed while I was waiting for RMAs) and replace it with the Corsair VENGANCE 128G kit (CMK128GX4M8B3000C16) that is on the QVL. There is a specific version number on the QVL, and the part has a low rating NewEgg - but if this G.SKILL RAM works then it’s probably my best shot of having a 128G installed with this motherboard. Otherwise I’ll probably wait for the next generation Threadripper boards to come out and try again. I’m going to be traveling for work the next two weeks, but I’ll post an update as soon as I am able to get the G.SKILL kit and test again.

Have you tried underclocking the ram?

Bring it down to 2933 with 16 or 18 main timings and see if that helps.

I’m also assuming you have the latest bios? As best I can tell it’s the best yet as far as ram stability, even my 1950x is stable with 8 sticks 8gb ecc @3000 14/15/15/16

Maybe you could try to set the soc voltage to 1.2V.

Since the RAM isn’t on the QVL, it’s defaulting to 2133 base timings. Stability is even worse at those values, I’m luck to get two boots before the system fall back to the hung no-POST state. A full system power drain for several hours (with CMOS battery removed) will recover again for a short time.

I’m running the latest available BIOS, 3.60 - but I see that they just dropped a 3.80 a few days ago so I’ll give that a shot when I get home to that machine this weekend.

After watching the recent JayzTwoCents videos, I had the same thought. While Tcpu is has been in the 40s, Tctl has been in the upper 70s and the air moving through my radiator is much warmer than I expected for a system at idle.

Got a chance between work travel to try new components and BIOS combinations. The G.SKILL kit from the QVL didn’t make any difference, neither did the 3.80 BIOS. Opened another support request with ASRock, but haven’t heard back yet. Since I’m out of town on business travel next week, I figure that’s when they will respond. At this point, I’m considering just holding out until the new Threadripper parts ship next month and see if the 2950X will work with the new motherboards and trying that. Any other thoughts or suggestions?

Final follow-up: as it turns out the RAM was not the problem. The CPU was. Got that back from the RMA process the other day and I’m writing this post from the original board with the Corsair Dominator RAM without issue. Everything seems to be working, and I just sent back the MSI board for a refund since that was not the issue. Sad it took this long to get here, but I’m happy everything is working as I had originally hoped. Now I just need to plan out the replacement for the ticking time bomb that is my Enermax Liqtech TR4 II cooler (probably a custom EK loop next year).