Ryzen/Vega passthrough issues

isitaboat · May 22, 2020, 3:02am

Hey all, recently swapped from Intel (an nice old HP Z820) to a Ryzen 3900x, basically porting all the core hardware across. However, I can’t get my passthrough working on windows, even though it was flawless before on the intel chip.

So, I’m running:

3900x / Vega 56 (Sapphire) / Asus WS X570 Pro / Nvidia K600 for host
I’m running a K600 in slot 3, and the Vega 56 in slot 1. The bios is set to slot 3 in the only place I can find.
Qemu 5.0.0 (and rc0, same results so far)
Enabled IOMMU (VFIO is grabbing the cards - https://gist.github.com/ukd1/055d33eb72a4ae78be0a4051018d1d9e#file-host-dmesg-txt-L2 + https://gist.github.com/ukd1/055d33eb72a4ae78be0a4051018d1d9e#file-host-dmesg-txt-L885)
Enabled Virt in bios (can boot a vm fine)
My QEMU config - https://gist.github.com/ukd1/055d33eb72a4ae78be0a4051018d1d9e#file-win-sh-L1
host lspci output -https://gist.github.com/ukd1/055d33eb72a4ae78be0a4051018d1d9e#file-lspci-nnk

I get the following error from Qemu on startup, but not every time (?):

Failed to mmap 0000:0d:00.0 BAR 0. Performance may be slow

Windows just hangs on boot (I can see via VNC), and nothing appears on the screen. No idea how to debug this. So, installed ubuntu in a VM, added the GPU. It doesn’t crash, but doesn’t use the GPU. It’s listed in lspci in the vm. In dmesg in that vm, I get the following:

gist.github.com

https://gist.github.com/ukd1/055d33eb72a4ae78be0a4051018d1d9e#file-vm-dmesg-txt-L763

host dmesg.txt

[    0.000000] Linux version 5.4.0-31-generic (buildd@lgw01-amd64-059) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #35-Ubuntu SMP Thu May 7 20:20:34 UTC 2020 (Ubuntu 5.4.0-31.35-generic 5.4.34)
[    0.000000] Command line: BOOT_IMAGE=/BOOT/ubuntu_u61kek@/vmlinuz-5.4.0-31-generic root=ZFS=rpool/ROOT/ubuntu_u61kek ro quiet splash amd_iommu=on iommu=pt vfio-pci.ids=1002:aaf8,1002:687f,1043:87c0 vt.handoff=1
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Hygon HygonGenuine
[    0.000000]   Centaur CentaurHauls
[    0.000000]   zhaoxin   Shanghai  
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'

This file has been truncated. show original

lspci -nnk

0d:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 XL/XT [Radeon RX Vega 56/64] [1002:687f] (rev c3)
        Subsystem: Sapphire Technology Limited Vega 10 XL/XT [Radeon RX Vega 56/64] [1da2:e376]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
0d:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
        Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 HDMI Audio [Radeon Vega 56/64] [1002:aaf8]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel

vm dmesg.txt

[    0.000000] Linux version 5.4.0-31-generic (buildd@lgw01-amd64-059) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #35-Ubuntu SMP Thu May 7 20:20:34 UTC 2020 (Ubuntu 5.4.0-31.35-generic 5.4.34)
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.0-31-generic root=UUID=5af85d7a-a950-420b-a4d9-819a3b7d5343 ro quiet splash vt.handoff=7
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Hygon HygonGenuine
[    0.000000]   Centaur CentaurHauls
[    0.000000]   zhaoxin   Shanghai  
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'

This file has been truncated. show original

There are more than three files. show original

[ 2.704245] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xd0000000 -> 0xdfffffff
[ 2.704246] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xe0000000 -> 0xe01fffff
[ 2.704246] amdgpu 0000:01:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xfe800000 -> 0xfe87ffff

then… (https://gist.github.com/ukd1/055d33eb72a4ae78be0a4051018d1d9e#file-vm-dmesg-txt-L828)

[ 7.777788] [drm:atom_op_jump [amdgpu]] ERROR atombios stuck in loop for more than 5secs aborting
[ 7.777826] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing DA12 (len 279, WS 16, PS 4) @ 0xDB16
[ 7.777860] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing AA02 (len 228, WS 8, PS 4) @ 0xAAD6
[ 7.777894] [drm:amdgpu_atom_execute_table_locked [amdgpu]] ERROR atombios stuck executing 9DE4 (len 368, WS 0, PS 8) @ 0x9E5A
[ 7.777896] amdgpu 0000:01:00.0: gpu post error!
[ 7.777898] amdgpu 0000:01:00.0: Fatal error during GPU init
[ 7.777947] [drm] amdgpu: finishing device.

Pretty stuck. Thinking of trying to disable acpi, but other than that I’m kinda stuck / googled-out. Any suggestions welcome!

Thanks

isitaboat · May 22, 2020, 10:05pm

So, disabling ACPI caused all kinda problems; the machine didn’t boot right as the K600 didn’t get an IRQ somehow. Reverted. Now trying to play with efifb=off / CSM mode.

isitaboat · May 22, 2020, 10:44pm

Well, finally figured out that this card was being put in a D3 state, which stopped things working. I didn’t apply the gnif patch yet - I’m running 5.4…

Using a work around by removing the GPU + sound card part, suspending to ram, then rescanning. It’s kinda annoying…but it “works”.

Any better ideas? Any idea why I didn’t experience this on Intel?

Wolf_vx · May 22, 2020, 11:26pm

Failed to map rom bar is the clue here. I hit the same problem on my rtx and GTX being in the first slot on Asus mobo. Basically it’s framebuffer’s fault. It is mapped to your first slot GPU and you should be seeing Asus logo on the screen connected to the GPU. Do video=efifb:off in grub or whatever boot parameters you use. After that you shouldn’t see the logo anymore and it should pass to vm without a problem. There are qemu hooks and scripts you can run to unbind framebuffer on vm start if you are interested. This is basically “passing boot GPU problem”

isitaboat · May 23, 2020, 1:46am

Thanks; trying that now. Think swapping a slot would avoid this? I’m a day ago, I changed to running the K600 in slot 1, and the Vega in slot 2 - didn’t change anything for me.

isitaboat · May 23, 2020, 1:50am

Just tried that;

[ 0.000000] Linux version 5.4.0-31-generic (buildd@lgw01-amd64-059) (gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)) #35-Ubuntu SMP Thu May 7 20:20:34 UTC 2020 (Ubuntu 5.4.0-31.35-generic 5.4.34)
[ 0.000000] Command line: BOOT_IMAGE=/BOOT/ubuntu_u61kek@/vmlinuz-5.4.0-31-generic root=ZFS=rpool/ROOT/ubuntu_u61kek ro quiet splash amd_iommu=on iommu=pt vfio-pci.ids=1002:aaf8,1002:687f,1043:87c0 video=efifb:off vt.handoff=1

But the status (fresh boot) is:

   Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
           Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

isitaboat · May 23, 2020, 2:01am

I’m now stuck again - the remove -

echo 1 > /sys/bus/pci/devices/0000:0d:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0d:00.1/remove
systemctl suspend

then rescan + status just returns D3 all the time;

sudo echo 1 > /sys/bus/pci/rescan
lspci -vv -s 0d:00.00 | grep Status

Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-

So…I’m stuck again. I guess I must have done something else that helped, which I’ve now forgotten. Ugh.

Wolf_vx · May 23, 2020, 1:53pm

One last suggestion is to dump vbios from the card and provide that to the vm. Also it would help if you can post .xml config. Your dmesg shows something about vm bioses and looks like wrong one but I could be wrong. Also I just remembered, did you update your Asus mobo to the latest bios? There was a period when AGESA broke passthrough.

isitaboat · May 23, 2020, 4:37pm

I did update to the latest. Trying to dump the card bios via linux resulted in an error - I’ll try again.

It’s super weird as a) it booted for an hour or so yesterday b) it’s been booted with that bios on windows for months…

Wish I knew what I did to get it out of D3 properly.

Any idea if gnif’s patch is gonna help?

Wolf_vx · May 23, 2020, 5:09pm

I don’t know. If it is actually d3 patch could help but I don’t know. It all sounds very weird on how it behaves. Latest kernel update to my host broke for me Nvidia cards initialization somehow. Now I need to do pci rescan and vfio driver assignment on vm start. Otherwise cards don’t even appear in lspci.

isitaboat · May 23, 2020, 5:28pm

Ya; that’s the only other difference (other than intel -> amd), I’m now running 5.4 not 5.3. I’m gonna try a downgrade and see if that helps - seems weird this bug is specific to Ryzen, and not the GPU.

Will try the patch after that…

Wolf_vx · May 24, 2020, 12:37am

Yes, maybe downgrading to a known good linux build is a good idea. Latest kernel is absolute bonkers with it’s issues. I am running latest pop os which introduced weird PCI problems and the previous version had absolutely 0 issues. I am also running Ryzen (3950x in this case) on x570 (Asus hero 8) FYI.