AMD Polaris, Vega & Navi Reset Project - vendor-reset

Would anyone be able to assist me in building this? I’m trying to use this on Fedora Silverblue. As far as I know, there’s no easy to do custom kernels on Fedora Silverblue, so I was excited to see that this can be used just with modprobe (I don’t think I can use dkms on Silverblue either.)

As there’s no release provided to simply download, I’m trying to build this in a Fedora Toolbox. I’ve installed make, go to the source code directory, run make, and get:

make -C /lib/modules/5.9.11-200.fc33.x86_64/build M=/var/home/myusername/vendor-reset modules
make[1]: *** /lib/modules/5.9.11-200.fc33.x86_64/build: No such file or directory.  Stop.
make: *** [Makefile:8: build] Error 2

In a Toolbox, /lib/modules/ doesn’t seem to exist, and instead is under /run/host/usr/lib/modules/. I’ve verified that if I run ls /run/host/usr/lib/modules/ in the Toolbox I can see the 5.9.11-200.fc33.x86_64 folder that it’s looking for. So I edited the Makefile file and changed KDIR accordingly… Now when I run make, I get:

make -C /run/host/usr/lib/modules/5.9.11-200.fc33.x86_64/build M=/var/home/myusername/vendor-reset modules
make[1]: *** /run/host/usr/lib/modules/5.9.11-200.fc33.x86_64/build: No such file or directory.  Stop.
make: *** [Makefile:8: build] Error 2

Seems like changing the path didn’t help anything.

Could anyone give me any assistance in getting this thing compiled? Thanks.

strange, after i have to replace my motherboard (old one MSI B450 Tomahawk, new one MSI B450 Tomahawk MAX ) i cant get any output from my guest GPU anymore :frowning:

it shows always error code 43 and the amd driver software show me a message that the AMD Driver is not installed or does not work properly.

image

this is my dmesg output:

[ 1874.346696] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346703] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346708] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346715] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346720] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346727] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346732] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346739] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346744] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346751] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346756] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346763] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346768] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346775] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346785] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346792] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346799] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346805] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346812] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346819] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346825] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346832] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346839] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346846] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346853] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346859] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346866] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1874.346896] vfio-pci 0000:28:00.0: BAR 0: can't reserve [mem 0xd0000000-0xdfffffff 64bit pref]
[ 1943.797100] vfio-pci 0000:28:00.1: refused to change power state from D0 to D3hot
[ 1949.083798] vfio-pci 0000:28:00.1: refused to change power state from D0 to D3hot
[ 1949.083883] vfio-pci 0000:28:00.0: AMD_NAVI10: version 1.1
[ 1949.083883] vfio-pci 0000:28:00.0: AMD_NAVI10: performing pre-reset
[ 1949.083955] vfio-pci 0000:28:00.0: AMD_NAVI10: performing reset
[ 1949.085045] vfio-pci 0000:28:00.0: No more image in the PCI ROM
[ 1949.304722] vfio-pci 0000:28:00.0: AMD_NAVI10: bus reset disabled? yes
[ 1949.304727] vfio-pci 0000:28:00.0: AMD_NAVI10: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? no
[ 1949.304728] vfio-pci 0000:28:00.0: AMD_NAVI10: Clearing scratch regs 6 and 7
[ 1949.304729] vfio-pci 0000:28:00.0: AMD_NAVI10: begin psp mode 1 reset
[ 1949.518353] vfio-pci 0000:28:00.0: AMD_NAVI10: timed out waiting for PSP to reach valid state, but continuing anyway
[ 1950.040432] vfio-pci 0000:28:00.0: AMD_NAVI10: mode1 reset succeeded
[ 1960.773754] vfio-pci 0000:28:00.0: AMD_NAVI10: timed out waiting for PSP bootloader to respond after reset
[ 1960.773760] vfio-pci 0000:28:00.0: AMD_NAVI10: failed to reset device
[ 1960.773761] vfio-pci 0000:28:00.0: AMD_NAVI10: performing post-reset
[ 1960.773851] vfio-pci 0000:28:00.0: AMD_NAVI10: reset result = 0
[ 1960.790420] vfio-pci 0000:28:00.0: refused to change power state from D0 to D3hot

The patch has solved the hang on reboot/restart problem. However, it seems that my 5700XT becomes unusable. Windows 10 guest still detects it but it is disabled with an error code 43. Enabling it does not have an effect. Tried using virsh destroy instead of Windows shutdown but no difference. Using Fedora 32. Reboot fixes the problem and the 5700XT works again in Windows guest.1

dmesg output below. Any ideas how should I troubleshoot this?

[406184.964879] vfio-pci 0000:03:00.0: enabling device (0400 -> 0403)
[406184.965039] vfio-pci 0000:03:00.0: AMD_NAVI10: version 1.1
[406184.965040] vfio-pci 0000:03:00.0: AMD_NAVI10: performing pre-reset
[406184.965156] vfio-pci 0000:03:00.0: AMD_NAVI10: performing reset
[406185.081982] ATOM BIOS: 111
[406185.081984] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[406185.081988] vfio-pci 0000:03:00.0: AMD_NAVI10: bus reset disabled? yes
[406185.081994] vfio-pci 0000:03:00.0: AMD_NAVI10: SMU response reg: 1, sol reg: 22a94f32, mp1 intr enabled? yes, bl ready? no
[406185.081995] vfio-pci 0000:03:00.0: AMD_NAVI10: Clearing scratch regs 6 and 7
[406185.082044] vfio-pci 0000:03:00.0: AMD_NAVI10: begin psp mode 1 reset
[406185.348274] vfio-pci 0000:03:00.0: AMD_NAVI10: timed out waiting for PSP to reach valid state, but continuing anyway
[406185.855440] vfio-pci 0000:03:00.0: AMD_NAVI10: mode1 reset succeeded
[406186.622484] virbr0: port 2(vnet0) entered forwarding state
[406186.622498] virbr0: topology change detected, propagating
[406196.246567] vfio-pci 0000:03:00.0: AMD_NAVI10: timed out waiting for PSP bootloader to respond after reset
[406196.246580] vfio-pci 0000:03:00.0: AMD_NAVI10: failed to reset device
[406196.246581] vfio-pci 0000:03:00.0: AMD_NAVI10: performing post-reset
[406196.246741] vfio-pci 0000:03:00.0: AMD_NAVI10: reset result = 0
[406196.246953] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[406196.246969] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[406196.246975] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[406196.246978] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[406196.246981] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x27@0x440

Note that the problem does not occur after a reboot so I’m not sure if it is simply a one off so will KIV for now.

Normal dmesg for reference

[ 314.669955] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 791.076006] vfio-pci 0000:03:00.0: AMD_NAVI10: version 1.1
[ 791.076010] vfio-pci 0000:03:00.0: AMD_NAVI10: performing pre-reset
[ 791.076152] vfio-pci 0000:03:00.0: AMD_NAVI10: performing reset
[ 791.198767] ATOM BIOS: 111
[ 791.198769] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[ 791.198773] vfio-pci 0000:03:00.0: AMD_NAVI10: bus reset disabled? yes
[ 791.198779] vfio-pci 0000:03:00.0: AMD_NAVI10: SMU response reg: 1, sol reg: 1397bf05, mp1 intr enabled? yes, bl ready? yes
[ 791.198779] vfio-pci 0000:03:00.0: AMD_NAVI10: Clearing scratch regs 6 and 7
[ 791.198903] vfio-pci 0000:03:00.0: AMD_NAVI10: begin psp mode 1 reset
[ 791.704662] vfio-pci 0000:03:00.0: AMD_NAVI10: mode1 reset succeeded
[ 793.680348] vfio-pci 0000:03:00.0: AMD_NAVI10: PSP mode1 reset successful
[ 793.680359] vfio-pci 0000:03:00.0: AMD_NAVI10: performing post-reset
[ 793.692356] vfio-pci 0000:03:00.0: AMD_NAVI10: reset result = 0
[ 794.011611] vfio-pci 0000:03:00.1: vfio_bar_restore: reset recovery - restoring BARs
[ 795.335078] usb 1-4: reset full-speed USB device number 2 using xhci_hcd
[ 815.328926] usb 1-4: reset full-speed USB device number 2 using xhci_hcd

For some reason, my Win10 VM stopped working after a dnf update. Starting it will hang the host system.
Removing PCI host devices will allow it to boot so I decided to update and upgrade to Fedora 33 while fixing it.

Downloaded the latest version of vendor-reset 0.1.0 and dkms installed it (make install fails with SSL error no such file/directory crypto/bio/bss_file.c). Seems to be working

dmesg | grep vendor
[ 2.815006] vendor_reset: loading out-of-tree module taints kernel.
[ 2.815050] vendor_reset: module verification failed: signature and/or required key missing - tainting kernel
[ 2.831382] vendor_reset_hook: installed

vfio-pci driver grabbed the devices

03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev c1)
Subsystem: Tul Corporation / PowerColor Device [148c:2398]
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38]
Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel

qemu error seems to be that it does not own the IOMMU group but all devices on the group is passed to VFIO.

021-02-09 12:21:17.528+0000: Domain id=1 is tainted: custom-argv
2021-02-09 12:21:17.528+0000: Domain id=1 is tainted: host-cpu
char device redirected to /dev/pts/1 (label charserial0)
2021-02-09T12:21:18.597872Z qemu-system-x86_64: warning: This family of AMD CPU doesn’t support hyperthreading(2)
Please configure -smp options properly or try enabling topoext feature.
2021-02-09T12:21:21.446390Z qemu-system-x86_64: vfio: Cannot reset device 0000:03:00.1, depends on group 10 which is not owned.
2021-02-09T12:21:21.452383Z qemu-system-x86_64: vfio: Cannot reset device 0000:03:00.1, depends on group 10 which is not owned.
2021-02-09T12:29:03.901272Z qemu-system-x86_64: terminating on signal 15 from pid 2748 (/usr/sbin/libvirtd)
2021-02-09 12:29:05.357+0000: shutting down, reason=shutdown
2021-02-09 12:29:49.349+0000: shutting down, reason=failed
2021-02-09 15:15:30.535+0000: shutting down, reason=failed

Relevant IOMMU listing

IOMMU Group 0 00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-1fh) PCIe Dummy Host Bridge [1022:1452]
IOMMU Group 10 03:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT] [1002:731f] (rev c1)
IOMMU Group 11 03:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 HDMI Audio [1002:ab38]
IOMMU Group 12 04:00.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] 400 Series Chipset USB 3.1 XHCI Controller [1022:43d5] (rev 01)

Any advice or suggestions?

A BIOS update fixed the host freezing problem. Not sure how/why since it was working fine before the OS update.
However, sadly, I’m back to square one with the guest not able to boot up properly. Stuck at TianoCore screen and vfio reset fails.

[ 463.310753] vfio-pci 0000:03:00.0: AMD_NAVI10: version 1.1
[ 463.310757] vfio-pci 0000:03:00.0: AMD_NAVI10: performing pre-reset
[ 463.310950] vfio-pci 0000:03:00.0: AMD_NAVI10: performing reset
[ 463.423873] ATOM BIOS: 111
[ 463.423876] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[ 463.698226] vfio-pci 0000:03:00.0: AMD_NAVI10: bus reset disabled? yes
[ 463.698233] vfio-pci 0000:03:00.0: AMD_NAVI10: SMU response reg: 0, sol reg: 0, mp1 intr enabled? no, bl ready? yes
[ 463.698237] vfio-pci 0000:03:00.0: AMD_NAVI10: performing post-reset
[ 463.722074] vfio-pci 0000:03:00.0: AMD_NAVI10: reset result = 0

libvirt log says no available reset mechanism

2021-02-11T19:38:53.310941Z qemu-system-x86_64: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.
2021-02-11T19:42:13.518717Z qemu-system-x86_64: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.
2021-02-11T19:42:13.957957Z qemu-system-x86_64: vfio: Cannot reset device 0000:03:00.1, no available reset mechanism.

any ETA on a working version of this module?

It is working now, as far as I am aware.
You will have to be a lot more specific with your question.

1 Like

Linux 5.8.10
Navi10 bound to vfio-pci at 0000:48:00.x

here is my full dmesg log from recreating the failure. venderresetbroke.txt (148.7 KB)

i know that my host machine is a real mess in many ways. the messages relevant here are around timestamps:
127.293974
168.704222
230.400959
330.345184

if you need any more information about my setup please let me know and i will be happy to provide it.

Hi everyone.
I’ve been having some issues with Navi10 where after booting some guest OS the GPU hangs.

dmesg shows a timeout waiting for PSP to reach valid state:

[  271.594549] vfio-pci 0000:28:00.0: AMD_NAVI10: version 1.1
[  271.594550] vfio-pci 0000:28:00.0: AMD_NAVI10: performing pre-reset
[  271.594679] vfio-pci 0000:28:00.0: AMD_NAVI10: performing reset
[  271.699254] vendor-reset-drm: atomfirmware: bios_scratch_reg_offset initialized to 4c
[  271.699256] vfio-pci 0000:28:00.0: AMD_NAVI10: bus reset disabled? yes
[  271.699261] vfio-pci 0000:28:00.0: AMD_NAVI10: SMU response reg: 1, sol reg: ad50d73, mp1 intr enabled? yes, bl ready? yes
[  271.699261] vfio-pci 0000:28:00.0: AMD_NAVI10: Clearing scratch regs 6 and 7
[  271.699289] vfio-pci 0000:28:00.0: AMD_NAVI10: begin psp mode 1 reset
[  271.930135] vfio-pci 0000:28:00.0: AMD_NAVI10: timed out waiting for PSP to reach valid state, but continuing anyway
[  272.446438] vfio-pci 0000:28:00.0: AMD_NAVI10: mode1 reset succeeded
[  272.446543] vfio-pci 0000:28:00.0: AMD_NAVI10: PSP mode1 reset successful
[  272.446546] vfio-pci 0000:28:00.0: AMD_NAVI10: performing post-reset
[  272.486458] vfio-pci 0000:28:00.0: AMD_NAVI10: reset result = 0

After some digging, I’ve found the cause and decided to share here, hoping it will be useful for someone.

Turns out that it’s related to the amdgpu firmware loaded in the guest OS. Specifically, with version (20.45) which was pushed upstream in 2020-11-19.
I’ve confirmed that the issue isn’t present in older versions (<=20.40) and it’s also fixed in the newer version 20.50, upstreamed in 2021-03-22.

If you have this issue, check your distro’s linux-firmware package version and update/downgrade to avoid version 20.45 of the navi10 amdgpu firmware.

In Arch guests avoid the following linux-firmware package versions:

20201120.bc9cd0b-1-any
20201218.646f159-1-any
20210208.b79d239-1-any
20210315.3568f96-1-any

In Ubuntu Hirsute (21.04) guests avoid the following linux-firmware package versions:

1.193
1.194
1.195
3 Likes

i have updated my guest’s amdgpu-firmware files to 20.50, and i still cannot reboot the VM without rebooting the host.
here is what happens when the guest gracefully shuts down.
dmesg log:
navi10rstbrk.txt (122.0 KB)

and here is my trying multiple cycles of virsh destroy and then virsh start. to no success.
dmesg log:
navi10rstbrk2.txt (237.5 KB)

UPDATE: turns out i just forgot to update the guest’s initrd.
i find that gracefully shutting down the guest, then trying virsh start fails to reset navi10, but “virsh reset” while the guest is booted in that broken state, works fine. i will run more tests.

UPDATE2: ive found that the only way to reset the guest is to allow it into a broken state, then to use virsh reset. here is the dmesg from me demonstrating this: Navi10rstwrkg.txt (151.5 KB)
the relevant parts can be found around the timestamps: 181.670569, 212.341250, 225.632272, 348.511247, and maybe 429.401944.

its been a few months. still no stable fix for the reset bug?