Trying to setup KVM with GPU passthrough and failing

Hi.
Been trying for the last few weekends to get GPU passthrough to work but no success.

My setup:
CPU: Ryzen 1400
MB: Asrock fatal1ty AB350 gaming g4
GPU1: Saphire Nitro 580 4g
GPU2: Asus turbo 1070.
OS: Manjaro

Mainly following an online tutorial, but looks like i can’t post the link. GPU is not using the normal driver, it it using vfio.

This MB has 2 PCI-e slots:
upper one is x16, currently contains the AMD card.
lower one is x4, currently contains the nvidia card.

Trying to pass the upper one to the vm (from what i understand, that should be the corect slot, different IOMMU group, even though linux seems to say both GPUs are in separate iommu grous with no other devices).
The MB is using the latest bios for this cpu (5.80). Weird thing i noticed after updating to the latest bios, is the with 2 GPUs installed , the 2nd one (x4 slot) displays the BIOS information, not the x16 as i would have expected (and how it was in the previous 5.40 BIOS).

when staring the machine up, dmesg says this:

[ 1303.424279] audit: type=1100 audit(1586008955.236:127): pid=4003 uid=1000 auid=1000 ses=2 subj==unconfined msg=‘op=PAM:authentication grantors=pam_unix acct=“bugsbunny” exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success’
[ 1303.424508] audit: type=1101 audit(1586008955.236:128): pid=4003 uid=1000 auid=1000 ses=2 subj==unconfined msg=‘op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct=“bugsbunny” exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success’
[ 1303.424859] audit: type=1110 audit(1586008955.236:129): pid=4003 uid=0 auid=1000 ses=2 subj==unconfined msg=‘op=PAM:setcred grantors=pam_unix acct=“root” exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success’
[ 1303.430713] audit: type=1105 audit(1586008955.243:130): pid=4003 uid=0 auid=1000 ses=2 subj==unconfined msg=‘op=PAM:session_open grantors=pam_limits,pam_unix,pam_permit acct=“root” exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success’
[ 1304.070778] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 1304.070785] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 1304.070788] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/04400000
[ 1304.070792] pcieport 0000:00:03.1: AER: [21] ACSViol (First)
[ 1304.070870] pcieport 0000:00:03.1: AER: Device recovery successful
[ 1304.193408] AMD-Vi: Completion-Wait loop timed out
[ 1304.322797] AMD-Vi: Completion-Wait loop timed out
[ 1304.451120] AMD-Vi: Completion-Wait loop timed out
[ 1304.578720] AMD-Vi: Completion-Wait loop timed out
[ 1304.706641] AMD-Vi: Completion-Wait loop timed out
[ 1304.834088] AMD-Vi: Completion-Wait loop timed out
[ 1304.963111] AMD-Vi: Completion-Wait loop timed out
[ 1305.072410] audit: type=1101 audit(1586008956.883:131): pid=4033 uid=1000 auid=1000 ses=2 subj==unconfined msg=‘op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct=“bugsbunny” exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success’
[ 1305.072543] audit: type=1110 audit(1586008956.883:132): pid=4033 uid=0 auid=1000 ses=2 subj==unconfined msg=‘op=PAM:setcred grantors=pam_env,pam_fprintd acct=“root” exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success’
[ 1305.073796] pcieport 0000:00:03.1: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 1305.073805] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 1305.073808] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/04400000
[ 1305.073811] pcieport 0000:00:03.1: AER: [21] ACSViol (First)
[ 1305.073900] pcieport 0000:00:03.1: AER: Device recovery successful
[ 1305.076379] audit: type=1105 audit(1586008956.890:133): pid=4033 uid=0 auid=1000 ses=2 subj==unconfined msg=‘op=PAM:session_open grantors=pam_limits,pam_unix,pam_permit acct=“root” exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success’
[ 1305.194551] AMD-Vi: Completion-Wait loop timed out
[ 1305.323174] AMD-Vi: Completion-Wait loop timed out
[ 1305.451263] AMD-Vi: Completion-Wait loop timed out
[ 1305.580915] AMD-Vi: Completion-Wait loop timed out
[ 1305.581451] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x6235a21a0]
[ 1305.708645] AMD-Vi: Completion-Wait loop timed out
[ 1305.841362] AMD-Vi: Completion-Wait loop timed out
[ 1305.971683] AMD-Vi: Completion-Wait loop timed out
[ 1306.068199] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x6235a21e0]
[ 1306.068203] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x6235a2200]
[ 1306.076700] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 1306.076710] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 1306.076714] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/04400000
[ 1306.076718] pcieport 0000:00:03.1: AER: [21] ACSViol (First)
[ 1306.076810] pcieport 0000:00:03.1: AER: Device recovery successful
[ 1306.195940] AMD-Vi: Completion-Wait loop timed out
[ 1306.323691] AMD-Vi: Completion-Wait loop timed out
[ 1306.416321] vfio-pci 0000:0a:00.0: Refused to change power state, currently in D3
[ 1306.433110] vfio-pci 0000:0a:00.0: Refused to change power state, currently in D3
[ 1306.449649] vfio-pci 0000:0a:00.0: Refused to change power state, currently in D3
[ 1306.451510] AMD-Vi: Completion-Wait loop timed out
[ 1306.579069] AMD-Vi: Completion-Wait loop timed out
[ 1306.706676] AMD-Vi: Completion-Wait loop timed out
[ 1306.834355] AMD-Vi: Completion-Wait loop timed out
[ 1306.962034] AMD-Vi: Completion-Wait loop timed out
[ 1307.070062] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x6235a22b0]
[ 1307.079473] pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:00:00.0
[ 1307.079481] pcieport 0000:00:03.1: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
[ 1307.079484] pcieport 0000:00:03.1: AER: device [1022:1453] error status/mask=00200000/04400000
[ 1307.079487] pcieport 0000:00:03.1: AER: [21] ACSViol (First)
[ 1307.192990] vfio-pci 0000:0a:00.0: timed out waiting for pending transaction; performing function level reset anyway
[ 1307.197646] AMD-Vi: Completion-Wait loop timed out
[ 1307.325138] AMD-Vi: Completion-Wait loop timed out
[ 1307.452855] AMD-Vi: Completion-Wait loop timed out
[ 1307.580452] AMD-Vi: Completion-Wait loop timed out
[ 1307.708163] AMD-Vi: Completion-Wait loop timed out
[ 1307.835943] AMD-Vi: Completion-Wait loop timed out
[ 1307.963560] AMD-Vi: Completion-Wait loop timed out
[ 1308.071939] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x6235a22e0]
[ 1308.199634] AMD-Vi: Completion-Wait loop timed out
[ 1308.327578] AMD-Vi: Completion-Wait loop timed out
[ 1308.419649] vfio-pci 0000:0a:00.0: not ready 1023ms after FLR; waiting
[ 1308.455595] AMD-Vi: Completion-Wait loop timed out
[ 1308.583425] AMD-Vi: Completion-Wait loop timed out
[ 1308.711005] AMD-Vi: Completion-Wait loop timed out
[ 1308.838619] AMD-Vi: Completion-Wait loop timed out
[ 1308.966189] AMD-Vi: Completion-Wait loop timed out
[ 1309.073819] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x6235a23c0]
[ 1309.073821] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0a:00.0 address=0x6235a23e0]

when running the vm, i get this:

QEMU 4.2.0 monitor - type ‘help’ for more information
(qemu) pulseaudio: Wrong context state
pulseaudio: Reason: Access denied
pulseaudio: Failed to initialize PA contextaudio: warning: Using timer based audio emulation
qemu-system-x86_64: vfio_err_notifier_handler(0000:0a:00.1) Unrecoverable error detected. Please collect any data possible and then kill the guest
qemu-system-x86_64: vfio_err_notifier_handler(0000:0a:00.0) Unrecoverable error detected. Please collect any data possible and then kill the guest

Also tried it with LinuxMint 19.2 (since that is what is used in the tutorial) and it behaved pretty much the same, only thing is that after 10 seconds it would freeze the entire PC :P.

Tried switching GPUs (so passing the nvidia one, on the x16 slot) but that resulted in it being stuck in D3. Tried disabling D3 state with a boot flag but it would still give me that error…

Using this script to start it:

#!/bin/bash

vmname=“windows10vm”

if ps -ef | grep qemu-system-x86_64 | grep -q multifunction=on; then
echo “A passthrough VM is already running.” &
exit 1

else

use pulseaudio

export QEMU_AUDIO_DRV=pa
export QEMU_PA_SAMPLES=8192
export QEMU_AUDIO_TIMER_PERIOD=99
export QEMU_PA_SERVER=/run/user/1000/pulse/native

cp /usr/share/ovmf/x64/OVMF_VARS.fd /tmp/my_vars.fd

qemu-system-x86_64
-name $vmname,process=$vmname
-machine type=q35,accel=kvm
-cpu host,kvm=off
-smp 1,sockets=1,cores=1,threads=1
-m 4G
-rtc clock=host,base=localtime
-vga none
-nographic
-serial none
-parallel none
-soundhw hda
-usb
-device usb-host,vendorid=0x24ae,productid=0x1001
-device vfio-pci,host=0a:00.0,multifunction=on
-device vfio-pci,host=0a:00.1
-drive if=pflash,format=raw,readonly,file=/usr/share/ovmf/x64/OVMF_CODE.fd
-drive if=pflash,format=raw,file=/tmp/my_vars.fd
-boot order=dc
-drive id=disk0,if=virtio,cache=none,format=raw,file=/home/bugsbunny/windows_vm/win.img
-drive file=/home/bugsbunny/Downloads/Win10_1909_English_x64.iso,index=1,media=cdrom
-drive file=/home/bugsbunny/Downloads/virtio-win-0.1.173.iso,index=2,media=cdrom
-netdev type=tap,id=net0,ifname=vmtap0,vhost=on
-device virtio-net-pci,netdev=net0,mac=02:68:b3:29:da:98

exit 0
fi

Does anyone have any ideas on what i could try? Or is my hardware not compatible with it? (heard rx580 is not the best card, but 1070 should have worked… maybe newer BIOS is to blame?)

Thanks!

I spend hours on a nvidia passthrough issue for my windows VM.
I solved my problem by passing a correct romfile= in my -device vfio-pci.

The rom file that works was extracted with TechPowerUp GPU-Z, then patched with a hex editor

maybe It can help.

Uriel.

Did you try passing through the one on the X4 slot? With the few boards I performed this on, the top PCI slot is usually defaulted to the primary video adapter. I always had to use the lower slot to pass through a GPU. This was my experience with the AMD 970, 990FX and X370 chipset.

I do believe there are boards that allow you to select the primary video controller so you can check your BIOS to see if there is such an option.

Can you also confirm that the GPU you are trying to pass through is claimed by the vfio driver?

lcpsi -v sample:

2f:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii PRO [Radeon R9 290/390] (rev 80) (prog-if 00 [VGA controller])
Subsystem: Gigabyte Technology Co., Ltd Hawaii PRO [Radeon R9 290/390]
Flags: fast devsel, IRQ 47
Memory at d0000000 (64-bit, prefetchable) [size=256M]
Memory at e0000000 (64-bit, prefetchable) [size=8M]
I/O ports at e000 [size=256]
Memory at f7a00000 (32-bit, non-prefetchable) [size=256K]
Expansion ROM at f7a40000 [disabled] [size=128K]
Capabilities:
Kernel driver in use: vfio-pci
Kernel modules: radeon, amdgpu

2f:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Hawaii HDMI Audio [Radeon R9 290/290X / 390/390X]
Subsystem: Gigabyte Technology Co., Ltd Hawaii HDMI Audio [Radeon R9 290/290X / 390/390X]
Flags: fast devsel, IRQ 65
Memory at f7a60000 (64-bit, non-prefetchable) [size=16K]
Capabilities:
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

@ [urielch4]: I will look into patching the bios. I managed to get a dump of the rom once after manually setting some bits on the card (cat on the rom folder of the pci device) but didn’t knew i also had to modify it. Will look into that. Is there any risk of bricking the card, or is the ROM not actually flashed to the device, just used?

@ [Windforce]:
Since with the latest BIOS, POST display defaults to the x4 card (weird) so i assumed this is now being used as the primary GPU. Also Level1Techs did a video on this MB called " ASRock AB350 Gaming K4 Motherboard Review + Linux Test" where Wendell said that the 1st slot is isolated and can be easily passed through… He did explicitly mention running the 1st (x16) slot with a GPU for the VM (since that has its own IOMMU group) and the 2nd (x4) slot for the host… It might not be relevant anymore though because of changes in the BIOS.
Also checked bios, i didn’t notice anything to let me select default GPU. There is an option (called Primary Video Adaptor), but it is greyed out (value is Ext Graphics(PEG).
Also have the following values there:
PCIe x16/2x8 switch -> 1x16
PCIe x16 switch -> Auto
Promotory PCIe Switch -> Auto
Unused GPP Clocks Off -> Disabled
Above 4GB MMIO Enable -> Disable

Not sure what those mean, but i doubt they are related to this…

lspci -v output:

0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] (rev e7) (prog-if 00 [VGA controller])
Subsystem: Sapphire Technology Limited Nitro+ Radeon RX 570/580/590
Flags: fast devsel, IRQ 10
Memory at e0000000 (64-bit, prefetchable) [disabled] [size=256M]
Memory at f0000000 (64-bit, prefetchable) [disabled] [size=2M]
I/O ports at e000 [disabled] [size=256]
Memory at f7900000 (32-bit, non-prefetchable) [disabled] [size=256K]
Expansion ROM at f7940000 [disabled] [size=128K]
Capabilities:
Kernel driver in use: vfio-pci
Kernel modules: amdgpu

0a:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
Subsystem: Sapphire Technology Limited Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590]
Flags: bus master, fast devsel, latency 0, IRQ 5
Memory at f7960000 (64-bit, non-prefetchable) [size=16K]
Capabilities:
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

Nice that you can use the X16 slot for pass through. I do recall getting the “AMD-Vi: Completion-Wait loop timed out” errors on the FX boards but can’t recall what I did back then.

I am aware of the reset bug with the 580 but the 1070 should work, curious to see the outcome.

So you think the best shot i have is to go back to trying to pass the 1070 on the x16 slot, and try and figure the D3 issue?
Btw, tried using the x16 slot (rx580) for host and the x4 slot (gtx1070) for VM, but when i set the x4 GPU to use VFIO (update /etc/modprobe.d/local.conf and /etc/mkinitcpio.conf), it cuts the display on both cards… the x16 still outputs a signal, but it is just a black screen. X4 card outputs nothing (as expected). Double checked it 2 times to make sure i wasn’t using the wrong IDs…

After a break, i came back to try it again :smiley:
Currently, trying to pass the nvidia 1070 to the windows vm. It is located in the 1st slot (x16).
Using the 2nd slot (x4) as my main linux gpu (rx 580).

Using an asrock MB with bios version 5.70 . Comes with AMD AGESA 0.0.7.2. There is 1 more BIOS update that i can install that pushes AGESA to Combo-AM4 1.0.0.1. Any newer AGESA version is not recommended since i have a gen1 ryzen CPU (1400).

This time, i followed the arch tutorial for this, basically step by step (PCI_passthrough_via_OVMF).

While this time i am not stuck in D3 states (at least from what i can tell), i still don’t see my GPU inside of windows. Windows does boot up just fine though.

dmesg logs:

sudo dmesg | grep vfio [ 0.745488] vfio-pci 0000:1f:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none [ 0.762781] vfio_pci: add [10de:1b81[ffffffff:ffffffff]] class 0x000000/00000000 [ 0.779470] vfio_pci: add [10de:10f0[ffffffff:ffffffff]] class 0x000000/00000000 [ 89.503706] vfio-pci 0000:1f:00.0: enabling device (0000 -> 0003) [ 89.504003] vfio-pci 0000:1f:00.0: vfio_ecap_init: hiding ecap [email protected] [ 90.733520] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 90.746944] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.608459] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.613254] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.633363] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.637301] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.652110] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.654731] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.720408] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.730682] vfio-pci 0000:1f:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff [ 92.730893] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.769787] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.770025] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.770063] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.770220] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.771121] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.771300] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.771338] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.771501] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.831288] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.832601] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.872575] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.873209] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.874504] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.874757] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs [ 92.931010] vfio-pci 0000:1f:00.0: vfio_bar_restore: reset recovery - restoring BARs [ 92.935116] vfio-pci 0000:1f:00.1: vfio_bar_restore: reset recovery - restoring BARs

Also, after shutting down the VM and starting it up again, i get
“Error starting domain: internal error: Unknown PCI header type ‘127’ for device ‘0000:1f:00.0’”. Restarting the entire PC allow me to boot into it again (i see this is a relative common issue, and i don’t think it is directly related to my main problem right now… dmesg does complaining about being stuck in D3 in this scenario…)

Any ideas on what might be the issue?

Another update:
The ASRock support team helped me to downgrade my bios to 5.40 and this fixed my issue.
Sadly the MB went back to preferring to use the x16 slot as the main one, so i need to pass the x4 slot to the VM (if i use vfio on the x16 slot, both GPUs output a black screen) but at least it works now! :smiley:
My 580 is affected by the reset bug (after using the VM once, i need to set my PC to sleep for 1-2 seconds). Does anyone know of a better workaround? I can only find stuff for newer GPUs (XT)

I had issues with using my x16 slot also. I had no option in BIOS to specify which slot to boot from, it just assumes the x16 slot. I managed to resolve it by adding kernel paramerer video=efifb:off to my grub config. And telling X11 to use the host GPU (x4 slot) like so…

In /etc/X11/xorg.conf.d/10-radeon.conf

Section "Device"
	Identifier "Radeon GPU"
	Driver "radeon"
	BusID "PCI:08:00"
EndSection

As for the reset bug… I wasn’t aware that Polaris cards were affected. I wish AMD would get their shit together. 3 architectures and this problem still exists!!!

Thanks, i will try it later.
Regarding the reset, not sure if it is the same reset bug, but it does behave similar…
I don’t get a ’ Unknown PCI header type ’ 127 ’ for device…’. The VM does start up. Windows then freezes for 1 minute while loading, goes to BSOD, reboots, and starts up just fine, but without the GPU working. I can see it in device manager, but it says that is has a problem…
Rebooting PC (or going to sleep) fixes the issue… Also no weird logs on dmesg even when it fails to start properly… weird that one.