Ryzen/Vega passthrough issues

Hey all, recently swapped from Intel (an nice old HP Z820) to a Ryzen 3900x, basically porting all the core hardware across. However, I can’t get my passthrough working on windows, even though it was flawless before on the intel chip.

So, I’m running:

I get the following error from Qemu on startup, but not every time (?):

Failed to mmap 0000:0d:00.0 BAR 0. Performance may be slow

Windows just hangs on boot (I can see via VNC), and nothing appears on the screen. No idea how to debug this. So, installed ubuntu in a VM, added the GPU. It doesn’t crash, but doesn’t use the GPU. It’s listed in lspci in the vm. In dmesg in that vm, I get the following:

[ 2.704245] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xd0000000 -> 0xdfffffff
[ 2.704246] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xe0000000 -> 0xe01fffff
[ 2.704246] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfe800000 -> 0xfe87ffff

then… (https://gist.github.com/ukd1/055d33eb72a4ae78be0a4051018d1d9e#file-vm-dmesg-txt-L828)

[ 7.777788] [drm:atom_op_jump [amdgpu]] ERROR atombios stuck in loop for more than 5secs aborting
[ 7.777826] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing DA12 (len 279, WS 16, PS 4) @ 0xDB16
[ 7.777860] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing AA02 (len 228, WS 8, PS 4) @ 0xAAD6
[ 7.777894] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing 9DE4 (len 368, WS 0, PS 8) @ 0x9E5A
[ 7.777896] amdgpu 0000:01:00.0: gpu post error!
[ 7.777898] amdgpu 0000:01:00.0: Fatal error during GPU init
[ 7.777947] [drm] amdgpu: finishing device.

Pretty stuck. Thinking of trying to disable acpi, but other than that I’m kinda stuck / googled-out. Any suggestions welcome!

Thanks

So, disabling ACPI caused all kinda problems; the machine didn’t boot right as the K600 didn’t get an IRQ somehow. Reverted. Now trying to play with efifb=off / CSM mode.

Well, finally figured out that this card was being put in a D3 state, which stopped things working. I didn’t apply the gnif patch yet - I’m running 5.4…

Using a work around by removing the GPU + sound card part, suspending to ram, then rescanning. It’s kinda annoying…but it “works”.

Any better ideas? Any idea why I didn’t experience this on Intel?

Failed to map rom bar is the clue here. I hit the same problem on my rtx and GTX being in the first slot on Asus mobo. Basically it’s framebuffer’s fault. It is mapped to your first slot GPU and you should be seeing Asus logo on the screen connected to the GPU. Do video=efifb:off in grub or whatever boot parameters you use. After that you shouldn’t see the logo anymore and it should pass to vm without a problem. There are qemu hooks and scripts you can run to unbind framebuffer on vm start if you are interested. This is basically “passing boot GPU problem”

2 Likes

Thanks; trying that now. Think swapping a slot would avoid this? I’m a day ago, I changed to running the K600 in slot 1, and the Vega in slot 2 - didn’t change anything for me.

Just tried that;

[ 0.000000] Linux version 5.4.0-31-generic (buildd@lgw01-amd64-059) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #35-Ubuntu SMP Thu May 7 20:20:34 UTC 2020 (Ubuntu 5.4.0-31.35-generic 5.4.34)
[ 0.000000] Command line: BOOT_IMAGE=/BOOT/ubuntu_u61kek@/vmlinuz-5.4.0-31-generic root=ZFS=rpool/ROOT/ubuntu_u61kek ro quiet splash amd_iommu=on iommu=pt vfio-pci.ids=1002:aaf8,1002:687f,1043:87c0 video=efifb:off vt.handoff=1

But the status (fresh boot) is:

   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
           Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

I’m now stuck again - the remove -

echo 1 > /sys/bus/pci/devices/0000:0d:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0d:00.1/remove
systemctl suspend

then rescan + status just returns D3 all the time;

sudo echo 1 > /sys/bus/pci/rescan
lspci -vv -s 0d:00.00 | grep Status

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

So…I’m stuck again. I guess I must have done something else that helped, which I’ve now forgotten. Ugh.

One last suggestion is to dump vbios from the card and provide that to the vm. Also it would help if you can post .xml config. Your dmesg shows something about vm bioses and looks like wrong one but I could be wrong. Also I just remembered, did you update your Asus mobo to the latest bios? There was a period when AGESA broke passthrough.

1 Like

I did update to the latest. Trying to dump the card bios via linux resulted in an error - I’ll try again.

It’s super weird as a) it booted for an hour or so yesterday b) it’s been booted with that bios on windows for months…

Wish I knew what I did to get it out of D3 properly.

Any idea if gnif’s patch is gonna help?

I don’t know. If it is actually d3 patch could help but I don’t know. It all sounds very weird on how it behaves. Latest kernel update to my host broke for me Nvidia cards initialization somehow. Now I need to do pci rescan and vfio driver assignment on vm start. Otherwise cards don’t even appear in lspci.

1 Like

Ya; that’s the only other difference (other than intel -> amd), I’m now running 5.4 not 5.3. I’m gonna try a downgrade and see if that helps - seems weird this bug is specific to Ryzen, and not the GPU.

Will try the patch after that…

Yes, maybe downgrading to a known good linux build is a good idea. Latest kernel is absolute bonkers with it’s issues. I am running latest pop os which introduced weird PCI problems and the previous version had absolutely 0 issues. I am also running Ryzen (3950x in this case) on x570 (Asus hero 8) FYI.

1 Like