Vega 64 VFIO Reset Issue Threadripper

So I’m trying to pass through a Vega 64 GPU to a Windows VM on my threadripper system. I was able to do this without any major issues about a month ago with a 1080. Now on the latest kernel it seems to work fine but if I reboot VM the card never recovers. I saw there was patch posted in the forum but it won’t apply to the newer kernel. Any help would be appreciated. Here’s my system info:

[timb@mako-arch ~]$ uname -r
4.18.16-arch1-1-ARCH

[timb@mako-arch ~]$ lspci -nnk | grep VGA
0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] [1002:687f] (rev c1)
42:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] [1002:67df] (rev e7)

[timb@mako-arch ~]$ sudo sh iommu.sh
[sudo] password for timb: 
IOMMU Group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 10 00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
IOMMU Group 11 00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 59)
IOMMU Group 11 00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
IOMMU Group 12 00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
IOMMU Group 12 00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
IOMMU Group 12 00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
IOMMU Group 12 00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
IOMMU Group 12 00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
IOMMU Group 12 00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
IOMMU Group 12 00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
IOMMU Group 12 00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
IOMMU Group 13 00:19.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 0 [1022:1460]
IOMMU Group 13 00:19.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 1 [1022:1461]
IOMMU Group 13 00:19.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 2 [1022:1462]
IOMMU Group 13 00:19.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 3 [1022:1463]
IOMMU Group 13 00:19.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 4 [1022:1464]
IOMMU Group 13 00:19.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 5 [1022:1465]
IOMMU Group 13 00:19.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 6 [1022:1466]
IOMMU Group 13 00:19.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Data Fabric: Device 18h; Function 7 [1022:1467]
IOMMU Group 14 01:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset USB 3.1 xHCI Controller [1022:43ba] (rev 02)
IOMMU Group 14 01:00.1 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset SATA Controller [1022:43b6] (rev 02)
IOMMU Group 14 01:00.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] X399 Series Chipset PCIe Bridge [1022:43b1] (rev 02)
IOMMU Group 14 02:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
IOMMU Group 14 02:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
IOMMU Group 14 02:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
IOMMU Group 14 02:06.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
IOMMU Group 14 02:07.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] 300 Series Chipset PCIe Port [1022:43b4] (rev 02)
IOMMU Group 14 04:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
IOMMU Group 14 05:00.0 Network controller [0280]: Intel Corporation Dual Band Wireless-AC 3168NGW [Stone Peak] [8086:24fb] (rev 10)
IOMMU Group 14 06:00.0 Ethernet controller [0200]: Intel Corporation I211 Gigabit Network Connection [8086:1539] (rev 03)
IOMMU Group 15 08:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 [144d:a808]
IOMMU Group 16 09:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1470] (rev c1)
IOMMU Group 17 0a:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Device [1022:1471]
IOMMU Group 18 0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] [1002:687f] (rev c1)
IOMMU Group 19 0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:aaf8]
IOMMU Group 1 00:01.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 20 0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device [1022:145a]
IOMMU Group 21 0c:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
IOMMU Group 22 0c:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
IOMMU Group 23 0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device [1022:1455]
IOMMU Group 24 0d:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 25 0d:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
IOMMU Group 26 40:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 27 40:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 28 40:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 29 40:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 2 00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 30 40:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 31 40:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 32 40:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 33 40:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
IOMMU Group 34 40:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 35 40:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
IOMMU Group 36 41:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981 [144d:a808]
IOMMU Group 37 42:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X] [1002:67df] (rev e7)
IOMMU Group 37 42:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 580] [1002:aaf0]
IOMMU Group 38 43:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device [1022:145a]
IOMMU Group 39 43:00.2 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Platform Security Processor [1022:1456]
IOMMU Group 3 00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 40 43:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) USB 3.0 Host Controller [1022:145c]
IOMMU Group 41 44:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Device [1022:1455]
IOMMU Group 42 44:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 4 00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 5 00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe GPP Bridge [1022:1453]
IOMMU Group 6 00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 7 00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 8 00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) Internal PCIe GPP Bridge 0 to Bus B [1022:1454]
IOMMU Group 9 00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) PCIe Dummy Host Bridge [1022:1452]

Final Edit: Figured out how to use code tags on this forum software yay!

I still haven’t been able to get Vega to reset properly either. Running Ubuntu 18.04 + QEMU 3.0.0 + Kernel 4.18.16 (also tried 4.19-rc6) on a 1950X system

Seen a few people say it’s working now, though. Maybe someone that has it working can share their settings so we’d be able to narrow down why it isn’t working for us.

1 Like

Yeah, having the same problem. I’ve tried 4.15, 4.17 and 4.18 Kernel and no luck, In the second boot of the card it crashes the host. In dmesg I get a lot of “AMD-Vi completion-wait loop timed out” I really want to solve this problem.

Setup: Ryzen 2700X, MSI X470 Gaming Pro, 16 Gb DDR4 Corsair 3000 MHz, Host Nvidia Quadro K4000, Guest: AMD Vega 64.

People that are not having this problem please help a brother out.

1 Like

Am I asking something wrong? Do I need to provide more information?

I signed up just to say: I’m having the same issue. I can passthrough the first time but if the host reboots or I try and start it again, it never comes up. If I don’t suspend to RAM first, I get a 127 pci error BUT if I suspend to RAM first, it just sits there doing nothing - no errors and no VNC output.

I’ve tried everything at this point, even tried going up to 4.19, I don’t know if that reset patch made it into the final kernel release though?

Seems we are both on X399, I wonder if that has something to do with this? Which AIB is yours? Mine is a Asus Vega 64 Strix.

I cannot believe this but for me I just found a workaround. Ignore sudo if you have root)

Step 1:

Shutdown the VM.

Step 2:

Run these two commands (edit for your PCI topology!):

echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:0a\:00.0/remove <-GPU
echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:0a\:00.1/remove <-HDMI

Step 3 (I’m not sure this is needed):

Suspend to RAM

Step 4:

Run this command:

echo "1" | sudo tee -a /sys/bus/pci/rescan

Your VM will start normally if you have the same kind of bug I do. I’ve spent a solid 20 hours on this, can’t believe I didn’t try device removal. Rescan shouldn’t work with a device with no known reset method BUT it does, weird. My kernel is stock 4.15 Ubuntu, no patches or anything.

3 Likes

I tried without the suspend to ram step, doesn’t work without that. I really hope this works for other people.

1 Like

I’ll give this a shot tomorrow. I swapped in my RX580 for now and it works like a dream.

I came very close to buying an RX 580 but I just wanted the extra performance too much, couldn’t help myself. Bit annoying having to suspend, better than restarting though and losing my desktop state. Hopefully AMD’s future card’s don’t suffer from this issue, the RX 560+ cards certainly seem to show they have revised in such a way that it is fixed.

Good luck with it all!

I’ve managed to also get the VM back up after a suspend( Ubuntu 4.15-38). Still pretty annoying. I usually only need the VM 10% of the time. So turining it off and back on is what I want. Using every other kernel easily available keeps the problem. Hope It can be fixed soon

Well this fixed my inability to even get passthrough to work as my video card was always left in some invalid state despite all my efforts to prevent initialization.

Which RX 580 card was working? I am looking for a card that doesn’t have the reset bug.

Which RX 580 card was working? I am looking for a card that doesn’t have the reset bug.

As far as I understand it, the RX 570 and RX580 don’t have the reset bug at all. The hardware revisions have corrected the problem.

1 Like

Hardware or software? I have read that some cards work and others do not. However that could have been with different chipsets, GPUs.

Now I’m not confident. I’ve read several times on Reddit that all the RX 570 and RX 580 cards were hardware reset bug immune. My understanding is that some RX 550’s are affected but not the higher end models. Further, from what I’ve read, the sapphire cards in general are the least likely to be affected by the bug.

Couldn’t hurt to try asking in https://www.reddit.com/r/VFIO/ and see which card vendors are good to go and if there is a kernel version in particular that works well.

Some people claim newer kernels fix the Vega bug for example but not for me.

Sorry I couldn’t give you a 100% definitive answer but I don’t have an RX 580 and I’d hate to give someone incorrect advice.

No problems, which is why I am searching for the answer. From everything I have read it works for some and not others. There doesn’t appear to be a pattern.

With my Asus Strix Vega 64 I was able to get it to pass through semi reliably by disabling UEFI boot in the host OS ( Linux 4.18.17-unRAID x86_64), dumping and loading the vbios and not passing through the audio device on the card. It’s bone stock no modded vbios etc however drivers cause it to crash in Windows non stop making it unbootable after first setup and requiring a reboot of the host to be able to pass it through again. Works great with a Fedora guest with no crashes or detectable performance degradation. Didn’t bother to run PTS before and after though.

So I’m using a 970 I “borrowed” from my little brothers PC while he’s grounded for the windows VM lol

1 Like

Here is the info i gathered about the vega. May not be accurate.

There is no working reset option and the vega will not survive the bus reset after initialized. If CSM is disabled, EFI boot is enabled for VGA or is the boot VGA vega is initialized by BIOS and will not survive bus reset.
Basically the vega does not support reset on the bus level but must be reset via PSP mode1 reset. The code exists in amdgpu driver but it would need to be ported to qemu as a quirk, if its even possible.

Workarounds:
https://gist.github.com/numinit/1bbabff521e0451e5470d740e0eb82fd this is the patch that will prevent the bus reset and vega can be booted and windows can be rebooted. Works also for pure EFI boot. Lets me pass the vega in first slot wired to NUMA1 as disabling CSM changes the boot order to loves port vga card.

Linux Host, Windows Guest, GPU passthrough reinitialization fix this describes how to setup windows so the VM can be shut down and started up. This simply forces windows to do the shut down and start up the GPU doing the reset for you. But it is no longer working in newest win10.

If the host crashes or the DevCon no longer works you need to do the whole suspend to RAM to reset the GPU or do a reboot.

1 Like

These two fixes are what I am currently using. Ltsc and ltsb can use the Devcon fix perfectly if you’re willing to go that route. Otherwise you can get around the issue by putting your system to sleep before starting the VM

none of this is needed with vfio_pci option disable_idle_d3=Y but this disables idle power saving so when VM is not running the card uses about 50W+. if I start the VM, it goes back to 5W idle.