Vega 64 VFIO Reset Issue Threadripper

Yeah, having the same problem. I’ve tried 4.15, 4.17 and 4.18 Kernel and no luck, In the second boot of the card it crashes the host. In dmesg I get a lot of “AMD-Vi completion-wait loop timed out” I really want to solve this problem.

Setup: Ryzen 2700X, MSI X470 Gaming Pro, 16 Gb DDR4 Corsair 3000 MHz, Host Nvidia Quadro K4000, Guest: AMD Vega 64.

People that are not having this problem please help a brother out.

1 Like

Am I asking something wrong? Do I need to provide more information?

I signed up just to say: I’m having the same issue. I can passthrough the first time but if the host reboots or I try and start it again, it never comes up. If I don’t suspend to RAM first, I get a 127 pci error BUT if I suspend to RAM first, it just sits there doing nothing - no errors and no VNC output.

I’ve tried everything at this point, even tried going up to 4.19, I don’t know if that reset patch made it into the final kernel release though?

Seems we are both on X399, I wonder if that has something to do with this? Which AIB is yours? Mine is a Asus Vega 64 Strix.

I cannot believe this but for me I just found a workaround. Ignore sudo if you have root)

Step 1:

Shutdown the VM.

Step 2:

Run these two commands (edit for your PCI topology!):

echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:0a\:00.0/remove <-GPU
echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:0a\:00.1/remove <-HDMI

Step 3 (I’m not sure this is needed):

Suspend to RAM

Step 4:

Run this command:

echo "1" | sudo tee -a /sys/bus/pci/rescan

Your VM will start normally if you have the same kind of bug I do. I’ve spent a solid 20 hours on this, can’t believe I didn’t try device removal. Rescan shouldn’t work with a device with no known reset method BUT it does, weird. My kernel is stock 4.15 Ubuntu, no patches or anything.

3 Likes

I tried without the suspend to ram step, doesn’t work without that. I really hope this works for other people.

1 Like

I’ll give this a shot tomorrow. I swapped in my RX580 for now and it works like a dream.

I came very close to buying an RX 580 but I just wanted the extra performance too much, couldn’t help myself. Bit annoying having to suspend, better than restarting though and losing my desktop state. Hopefully AMD’s future card’s don’t suffer from this issue, the RX 560+ cards certainly seem to show they have revised in such a way that it is fixed.

Good luck with it all!

I’ve managed to also get the VM back up after a suspend( Ubuntu 4.15-38). Still pretty annoying. I usually only need the VM 10% of the time. So turining it off and back on is what I want. Using every other kernel easily available keeps the problem. Hope It can be fixed soon

Well this fixed my inability to even get passthrough to work as my video card was always left in some invalid state despite all my efforts to prevent initialization.

Which RX 580 card was working? I am looking for a card that doesn’t have the reset bug.

Which RX 580 card was working? I am looking for a card that doesn’t have the reset bug.

As far as I understand it, the RX 570 and RX580 don’t have the reset bug at all. The hardware revisions have corrected the problem.

1 Like

Hardware or software? I have read that some cards work and others do not. However that could have been with different chipsets, GPUs.

Now I’m not confident. I’ve read several times on Reddit that all the RX 570 and RX 580 cards were hardware reset bug immune. My understanding is that some RX 550’s are affected but not the higher end models. Further, from what I’ve read, the sapphire cards in general are the least likely to be affected by the bug.

Couldn’t hurt to try asking in https://www.reddit.com/r/VFIO/ and see which card vendors are good to go and if there is a kernel version in particular that works well.

Some people claim newer kernels fix the Vega bug for example but not for me.

Sorry I couldn’t give you a 100% definitive answer but I don’t have an RX 580 and I’d hate to give someone incorrect advice.

No problems, which is why I am searching for the answer. From everything I have read it works for some and not others. There doesn’t appear to be a pattern.

With my Asus Strix Vega 64 I was able to get it to pass through semi reliably by disabling UEFI boot in the host OS ( Linux 4.18.17-unRAID x86_64), dumping and loading the vbios and not passing through the audio device on the card. It’s bone stock no modded vbios etc however drivers cause it to crash in Windows non stop making it unbootable after first setup and requiring a reboot of the host to be able to pass it through again. Works great with a Fedora guest with no crashes or detectable performance degradation. Didn’t bother to run PTS before and after though.

So I’m using a 970 I “borrowed” from my little brothers PC while he’s grounded for the windows VM lol

1 Like

Here is the info i gathered about the vega. May not be accurate.

There is no working reset option and the vega will not survive the bus reset after initialized. If CSM is disabled, EFI boot is enabled for VGA or is the boot VGA vega is initialized by BIOS and will not survive bus reset.
Basically the vega does not support reset on the bus level but must be reset via PSP mode1 reset. The code exists in amdgpu driver but it would need to be ported to qemu as a quirk, if its even possible.

Workarounds:
https://gist.github.com/numinit/1bbabff521e0451e5470d740e0eb82fd this is the patch that will prevent the bus reset and vega can be booted and windows can be rebooted. Works also for pure EFI boot. Lets me pass the vega in first slot wired to NUMA1 as disabling CSM changes the boot order to loves port vga card.

Linux Host, Windows Guest, GPU passthrough reinitialization fix this describes how to setup windows so the VM can be shut down and started up. This simply forces windows to do the shut down and start up the GPU doing the reset for you. But it is no longer working in newest win10.

If the host crashes or the DevCon no longer works you need to do the whole suspend to RAM to reset the GPU or do a reboot.

1 Like

These two fixes are what I am currently using. Ltsc and ltsb can use the Devcon fix perfectly if you’re willing to go that route. Otherwise you can get around the issue by putting your system to sleep before starting the VM

none of this is needed with vfio_pci option disable_idle_d3=Y but this disables idle power saving so when VM is not running the card uses about 50W+. if I start the VM, it goes back to 5W idle.

Could “vfio_pci option disable_idle_d3=Y” be toggled before starting the VM and after VM shutdown to avoid the high power when not in use?

of course not. the idle mode is what kills the ability to use the card. you must run the VM constantly.