Return to Level1Techs.com

Rx 470 stuck in D3 : single gpu passthrough

#1

Hi, I’m looking for advice with my VFIO setup. I want to run a headless Ubuntu installation with a Windows 10 VM taking my graphics card.

When trying to boot my VM, I am getting an error message:

QEMU 3.1.0 monitor - type ‘help’ for more information
(qemu) qemu-system-x86_64: vfio: Unable to power on device, stuck in
D3
qemu-system-x86_64: vfio: Unable to power on device, stuck in D3
qemu-system-x86_64: vfio_err_notifier_handler(0000:0a:00.1)
Unrecoverable error detected. Please collect any data possible and then
kill the guest
qemu-system-x86_64: vfio_err_notifier_handler(0000:0a:00.0)
Unrecoverable error detected. Please collect any data possible and then
kill the guest
qemu-system-x86_64: vfio_err_notifier_handler(0000:0a:00.1)
Unrecoverable error detected. Please collect any data possible and then
kill the guest
qemu-system-x86_64: vfio_err_notifier_handler(0000:0a:00.0)
Unrecoverable error detected. Please collect any data possible and then
kill the guest
qemu-system-x86_64: terminating on signal 2

dmesg gives me this :

[ 42.814189] vfio-pci 0000:0a:00.0: enabling device (0002 -> 0003)
[ 42.814472] vfio_ecap_init: 0000:0a:00.0 hiding ecap [email protected]
[ 42.814478] vfio_ecap_init: 0000:0a:00.0 hiding ecap [email protected]
[ 42.814482] vfio_ecap_init: 0000:0a:00.0 hiding ecap [email protected]
[ 42.816335] vfio-pci 0000:0a:00.1: enabling device (0000 -> 0002)
[ 44.068953] vfio_bar_restore: 0000:0a:00.1 reset recovery - restoring
bars
[ 44.085097] vfio_bar_restore: 0000:0a:00.0 reset recovery - restoring
bars
[ 44.090600] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 44.090607] pcieport 0000:00:03.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 44.090619] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00200000/04400000
[ 44.090625] pcieport 0000:00:03.1: [21] ACSViol (First)
[ 44.090693] pcieport 0000:00:03.1: AER: Device recovery successful
[ 44.255687] AMD-Vi: Completion-Wait loop timed out
[ 44.388945] AMD-Vi: Completion-Wait loop timed out
[ 45.088249] iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x40e197ef0]
[ 45.093462] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 45.093470] pcieport 0000:00:03.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 45.093480] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00200000/04400000
[ 45.093486] pcieport 0000:00:03.1: [21] ACSViol (First)
[ 45.093566] pcieport 0000:00:03.1: AER: Device recovery successful
[ 46.090192] iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x40e197f20]
[ 64.721511] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 64.721519] pcieport 0000:00:03.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 64.721529] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00200000/04400000
[ 64.721535] pcieport 0000:00:03.1: [21] ACSViol (First)
[ 64.721607] pcieport 0000:00:03.1: AER: Device recovery successful
[ 64.878907] AMD-Vi: Completion-Wait loop timed out
[ 65.048436] AMD-Vi: Completion-Wait loop timed out
[ 65.218326] AMD-Vi: Completion-Wait loop timed out
[ 65.387928] AMD-Vi: Completion-Wait loop timed out
[ 65.714486] iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x40e1962f0]
[ 65.724345] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 65.724353] pcieport 0000:00:03.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 65.724362] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00200000/04400000
[ 65.724368] pcieport 0000:00:03.1: [21] ACSViol (First)
[ 65.724434] pcieport 0000:00:03.1: AER: Device recovery successful
[ 66.716387] iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x40e196330]
[ 66.716399] iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x40e196350]
[ 66.716589] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 66.716597] pcieport 0000:00:03.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 66.716607] pcieport 0000:00:03.1: device [1022:1453] error status/mask=00200000/04400000
[ 66.716614] pcieport 0000:00:03.1: [21] ACSViol (First)
[ 66.716688] pcieport 0000:00:03.1: AER: Device recovery successful
[ 67.718274] iommu ivhd0: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x40e1963c0]

It looks like the AMD reset bug to me, but the suspend/rescan trick doesn’t change anything. Nor does vfio-pci.disable_idle_d3=1.

Is there anything I can to to make this work?

EDIT: should have added my grub config:

GRUB_CMDLINE_LINUX_DEFAULT=“quiet splash amd_iommu=on iommu=pt vfio-pci.ids=1002:67df,1002:aaf0 nofb video=efifb:off,vesafb:off vfio-pci.disable_idle_d3=1”

and my start script:

#!/bin/bash
vmname=“windows10vm”
if ps -ef | grep qemu-system-x86_64 | grep -q multifunction=on; then
echo “A passthrough VM is already running.” &
exit 1
else
cp /usr/share/OVMF/OVMF_VARS.fd /tmp/my_vars.fd
qemu-system-x86_64
-name windowsvm,process=windowsvm
-machine type=q35,accel=kvm
-cpu host
-smp 8
-m 12G
-rtc clock=host,base=localtime
-vga none
-nographic
-serial none
-parallel none
-usb
-device usb-host,vendorid=0x046a,productid=0xc52b
-device vfio-pci,host=0a:00.0,multifunction=on
-device vfio-pci,host=0a:00.1
-drive if=pflash,format=raw,readonly,file=/usr/share/OVMF/OVMF_CODE.fd
-drive if=pflash,format=raw,file=/tmp/my_vars.fd
-boot order=dc
-drive id=disk0,if=virtio,cache=none,format=raw,file=/home/fishnchips2/win.img
-drive file=/home/fishnchips2/Win10.iso,index=1,media=cdrom
-drive file=/home/fishnchips2/virtio-win.iso,index=2,media=cdrom
-netdev type=tap,id=net0,ifname=vmtap0,vhost=on
-device virtio-net-pci,netdev=net0,mac=00:16:3e:00:01:01
exit 0
fi

0 Likes

#2

unfortunately, i have some terrible news for you.
this is a common bug called the “AMD reset bug”.
this bug prevents single GPU passthrough from being possible.
there is no fix for the AMD reset bug yet.

0 Likes

#3

Ok thanks, do you know if progress is being made towards a fix?

0 Likes

#4

as far as i know, only AMD has the firmware source code necessary to make a kernel patch for the reset bug.
they have shown no interest in doing so.

0 Likes

#5

Oh, that’s a shame, thanks anyway.

0 Likes