Lspci returning "! Unknown header type 7f" for navi amdgpu passthrough after shutting down win10 guest

Hey guys! I don’t know if I’m newbie or not, but the last couple weeks of human malware & unemployment has caused me to pick up new-ish (stuff that was on sale) hardware and linux again with the intent of QEMU/KVM for Windows 10 entertainment and linux for everything else.

I’m running a Threadripper 1920x on ASRock x399 Fata1ity w/ 64G RAM, 3x500G NVME (for the OS & VMs), 2x4T Raid0 (data), amdgpu 590XT (host), amdgpu 5700xt (passthrough). Host OS is Ubuntu Server 18.04 with 5.5.11 kernel and guest is standard Win10 1909-Pro.

I’ve been able to get everything working pretty nicely, however when I shutdown then restart my Win10 VM I get:

Error starting domain: internal error: Unknown PCI header type ‘127’

,and lspci -vvn to check it goes from showing much more info about the Sapphire Nitro+ card to just:

0b:00.0 0300: 1002:731f (rev ff) (prog-if ff)
!!! Unknown header type 7f
Kernel driver in use: vfio-pci
Kernel modules: amdgpu

I’ve tried the navi reset script (I think gnif/wendel) for giggles, and it definitely does power them off, but they don’t come back on, so I was wondering if there was something that maybe I could do after the guest shuts down to try and get it back to a readable state?

Thanks in advance!

As far as I know the reset fix is only for Vega cards, because each card has a different initialisation sequence and @gnif didn’t have a Navi card to play with. Not sure if he ever received one.

So to my knowledge it’s currently not possible to get the card back to the Host after the Guest shuts down, unfortunately.

Darn. At least it works on an operable, albeit non-ideal level (rebooting host).

Thanks!

Yes I did, which resulted in this patch: Navi Reset Kernel Patch

2 Likes

Nice, seems I’m not up to date then :eyes:

1 Like

Sorry to necro this thread, but I’m also experiencing this issue (i.e., “!!! Unknown header type 7f”) The GPU I’m trying this with is an MSI “Airboost” Vega 56 which from the posts above shouldn’t still be experiencing the reset bug. Code regression or something else?

Additional info: CPU is a Ryzen 7 1700, and I’m running Fedora 33 with kernel vesion 5.13.8-100.fc33.x86_64.

More info: This does not appear to happen with Proxmox 7. I’m leaning towards a kernel issue, though there could be something different about the way the VMs are configured in each environment.

What makes you think this? The entire Vega and Navi series are bugged, even the earlier Polaris GPUs have issues. Only the newer Big Navi has been fixed.

Um, this quote below.

If that’s wrong, well then I suppose @mihawk90 is misinformed. In any event, updating to kernel 5.13.10-200.fc34.x86_64 by way of Fedora 34 seems to have resolved the issue. I presume there’s a workaround in either the more recent kernel or the libvirt code.

Well for one that quoted post is more then a year old…

And for another noone said it wasn’t an issue anymore. I don’t know whether the reset fix ever got into the mainline kernel (I think there was some talk back then that it was “too dirty” to bring it into mainline, but Gnif can probably clarify).
The reset fixes are in a repository here:

Nope, never upstreamed, too dirty hacky, and never 100% stable.

Refering to: Vega 10 and 12 reset application

Vega == Vega?

But yes, the repo linked does supersede all these patches/tools I wrote.

My misunderstanding. I thought certainly something so fundamentally important to using GPU pass through successfully would’ve been incorporated into the mainline kernel. I had no idea there was an entirely separate DKMS module for that express purpose.

And of course after I made that post and shut down my VM, the fan on my Vega started roaring, like it does when it needs a reboot, so the new kernel did not fix the problem. I’ve built the vendor-reset DKMS module and put a config file in /etc/modules-load.d/. It’s loaded successfully, but I won’t know for a while if it’s working on my funky hardware.

This is only needed on last gen and earlier AMD gpus as they are bugged and AMD refused to fix it for VFIO passthrough. vendor-reset is an effort to port the resets they use in amdgpu into a seperate module for VFIO usage.

AMD never advertised support for device resets nor do they openly support VGA passthrough. GPU passthrough works fine with other vendors that provide a means to reset their GPUs, it’s not on the kernel to have workarounds for vendor bugs.

…and one of the reasons I went with AMD was because of the Code 43 debacle with Nvidia. I know there are ways around Code 43, but I thought the AMD reset bug was easier to deal with.

Too bad all the players seem to want to kill GPU pass through.

Well it works on newer gen AMD and nvidia disabled the Code 43 BS sooooo.

Yes I’m finding that the scenery has changed since I last tried out GPU passthrough 2 or 3 years ago. I gave up partly because of the Code 43 and AMD reset nonsense.

I’m hoping to get Looking Glass working eventually but I’ve only just now gotten basic pass through working. The weekend is almost done, so I’ll probably just rest on my laurels and play a few games until next weekend.