6700XT - reset bug?

Hello all,

I just bought a new GPU (Gigabyte Aorus 6700XT) and installed it alongside my old Sapphire Nitro RX480. I used to do single-GPU passthrough of the RX480 and work in a Fedora VM which was quite stable (I use Proxmox on the host with the vendor-reset patch).

Now that I have both GPUs installed, I want to use the 6700XT on the VM.

The 6700XT is in PCI slot 1 and the RX480 is in PCI slot 2. I’ve set my BIOS to initialize PCI slot 2 (so BIOS and boot loader send their output to the RX480). I still have the kernel param to disable the EFI framebuffer output:

[email protected]:~# cat /efi/sda/loader/loader.conf
timeout 3
default nogpu-proxmox-*
[email protected]:~# cat /efi/sda/loader/entries/nogpu-proxmox-5.13.19-2-pve.conf
title    Proxmox Virtual Environment (headless)
version  5.13.19-2-pve
options  root=ZFS=rpool/ROOT/pve-1 boot=zfs amd_iommu=on video=efifb:off
linux    /EFI/proxmox/5.13.19-2-pve/vmlinuz-5.13.19-2-pve
initrd   /EFI/proxmox/5.13.19-2-pve/initrd.img-5.13.19-2-pve

Once running, I can attach my RX480 to Fedora and restart it at will, it works fine.

However, if I attach the 6700XT, it works once, but then if I reboot the VM it never comes back. The screen goes into power saving mode and never turns on again.

Does the 6000 series still have the reset bug? I thought it was fixed for the latest Navi chips?

What can I do to debug this?

Here is my lspci output:

[email protected]:~# lspci -nnk
00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex [1022:1480]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex [1022:1480]
00:00.2 IOMMU [0806]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU [1022:1481]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse IOMMU [1022:1481]
00:01.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:01.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
        Kernel driver in use: pcieport
00:02.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:03.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:03.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
        Kernel driver in use: pcieport
00:03.2 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge [1022:1483]
        Kernel driver in use: pcieport
00:04.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:05.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:07.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:07.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
        Kernel driver in use: pcieport
00:08.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge [1022:1482]
00:08.1 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B] [1022:1484]
        Kernel driver in use: pcieport
00:14.0 SMBus [0c05]: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller [1022:790b] (rev 61)
        Subsystem: Gigabyte Technology Co., Ltd FCH SMBus Controller [1458:5001]
        Kernel modules: i2c_piix4, sp5100_tco
00:14.3 ISA bridge [0601]: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge [1022:790e] (rev 51)
        Subsystem: Gigabyte Technology Co., Ltd FCH LPC Bridge [1458:5001]
00:18.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 0 [1022:1440]
00:18.1 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 1 [1022:1441]
00:18.2 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 2 [1022:1442]
00:18.3 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 3 [1022:1443]
        Kernel driver in use: k10temp
        Kernel modules: k10temp
00:18.4 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 4 [1022:1444]
00:18.5 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 5 [1022:1445]
00:18.6 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 6 [1022:1446]
00:18.7 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD] Matisse Device 24: Function 7 [1022:1447]
01:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream [1022:57ad]
        Kernel driver in use: pcieport
02:03.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a3]
        Kernel driver in use: pcieport
02:04.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a3]
        Kernel driver in use: pcieport
02:05.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a3]
        Kernel driver in use: pcieport
02:08.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]
        Kernel driver in use: pcieport
02:09.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]
        Kernel driver in use: pcieport
02:0a.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge [1022:57a4]
        Kernel driver in use: pcieport
03:00.0 USB controller [0c03]: ASMedia Technology Inc. Device [1b21:3241]
        Subsystem: Gigabyte Technology Co., Ltd Device [1458:5007]
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
04:00.0 Network controller [0280]: Intel Corporation Wi-Fi 6 AX200 [8086:2723] (rev 1a)
        Subsystem: Intel Corporation Wi-Fi 6 AX200 [8086:0084]
        Kernel driver in use: iwlwifi
        Kernel modules: iwlwifi
05:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I225-V [8086:15f3] (rev 01)
        Subsystem: Gigabyte Technology Co., Ltd Ethernet Controller I225-V [1458:e000]
        Kernel driver in use: igc
        Kernel modules: igc
06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
06:00.1 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:1486]
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
06:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:148c]
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
07:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
        Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901]
        Kernel driver in use: ahci
        Kernel modules: ahci
08:00.0 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
        Subsystem: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901]
        Kernel driver in use: ahci
        Kernel modules: ahci
09:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Upstream Port of PCI Express Switch [1002:1478] (rev c1)
        Kernel driver in use: pcieport
0a:00.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 10 XL Downstream Port of PCI Express Switch [1002:1479]
        Kernel driver in use: pcieport
0b:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [1002:73df] (rev ff)
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
0b:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28] (rev ff)
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev c7)
        Subsystem: PC Partner Limited / Sapphire Technology Radeon RX 470/480 [174b:e347]
        Kernel driver in use: vfio-pci
        Kernel modules: amdgpu
0c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
        Subsystem: PC Partner Limited / Sapphire Technology Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [174b:aaf0]
        Kernel driver in use: vfio-pci
        Kernel modules: snd_hda_intel
0d:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Function [1022:148a]
0e:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP [1022:1485]
0e:00.1 Encryption controller [1080]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
        Subsystem: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Cryptographic Coprocessor PSPCPP [1022:1486]
        Kernel driver in use: ccp
        Kernel modules: ccp
0e:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller [1022:149c]
        Subsystem: Gigabyte Technology Co., Ltd Matisse USB 3.0 Host Controller [1458:5007]
        Kernel driver in use: xhci_hcd
        Kernel modules: xhci_pci
0e:00.4 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse HD Audio Controller [1022:1487]
        Subsystem: Gigabyte Technology Co., Ltd Starship/Matisse HD Audio Controller [1458:a0d2]
        Kernel modules: snd_hda_intel
[email protected]:~#

6xxx series cards don’t need the reset module, likely you have a bios configuration not set properly.

Hi @robbbot

Thanks for confirming. I was sort of panicking as I skipped 5000 series due to this bug, and then prices when rocket-high and I had to save until I could get this card.

So at least it should be fixable. Anything in particular I should be looking for in my BIOS? The PCI slots 1 and 2 are bifurcated so both the 6700XT (slot1) and RX480 (slot2) run with x8 lanes. I’m not sure what speeds are used as one is capable of Gen4 and the other Gen3. Should I try forcing both to Gen3 speeds? Is it something else I should look for?

Appreciate any suggestions…

I passthrough a 6700XT without issues. Note that it requires a vbios to be set, unlike my old RX 460 I was using previously.

Not sure what your issue it. One thing to try is turning off “Link State Power Management” in bios and possibly ASPM in your boot settings. While not ideal for power consumption, historically doing this has resolved weird crashes, so it’s worth a try.

Can you elaborate on what this is?

I have a window to return the card until the 24th of February. What make/model of 6700XT are you using?

I will go into the BIOS and try to search for those settings.

So, looking at my BIOS settings I found ASPM was already disabled. So I’ve left it like that.

I also found the following message in the kernel logs when I shut down the VM using the card:

Feb 10 21:28:36 pve kernel: vfio-pci 0000:0b:00.1: refused to change power state from D0 to D3hot
Feb 10 21:28:36 pve kernel: vfio-pci 0000:0b:00.0: refused to change power state from D0 to D3hot

The second time I try to use the GPU by starting the VM without rebooting the host I see this when starting:

Feb 10 21:29:16 pve QEMU[23002]: kvm: vfio-pci: Cannot read device rom at 0000:0b:00.0
Feb 10 21:29:16 pve QEMU[23002]: Device option ROM contents are probably invalid (check dmesg).
Feb 10 21:29:16 pve QEMU[23002]: Skip option ROM probe with rombar=0, or load from file with romfile=
Feb 10 21:29:16 pve kernel: vfio-pci 0000:0b:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
Feb 10 21:29:16 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
Feb 10 21:29:16 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
...

When I try to shut down the 2nd incarnation of the VM I see:

Feb 10 21:29:16 pve QEMU[23002]: kvm: vfio-pci: Cannot read device rom at 0000:0b:00.0
Feb 10 21:29:16 pve QEMU[23002]: Device option ROM contents are probably invalid (check dmesg).
Feb 10 21:29:16 pve QEMU[23002]: Skip option ROM probe with rombar=0, or load from file with romfile=
Feb 10 21:29:16 pve kernel: vfio-pci 0000:0b:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
Feb 10 21:29:16 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
Feb 10 21:29:16 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs

I was able to try and start it a 3rd time but this time it failed to start and got completely stuck (requiring a host reboot):

Feb 10 21:31:50 pve pvedaemon[27188]: start VM 101: UPID:pve:00006A34:0000E7B7:62058446:qmstart:101:[email protected]:
Feb 10 21:31:50 pve kernel: vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level >
Feb 10 21:31:52 pve kernel: vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
Feb 10 21:31:53 pve kernel: vfio-pci 0000:0b:00.0: not ready 2047ms after FLR; waiting
Feb 10 21:31:55 pve kernel: vfio-pci 0000:0b:00.0: not ready 4095ms after FLR; waiting
Feb 10 21:31:59 pve kernel: vfio-pci 0000:0b:00.0: not ready 8191ms after FLR; waiting
Feb 10 21:32:08 pve kernel: vfio-pci 0000:0b:00.0: not ready 16383ms after FLR; waiting
...
Feb 10 21:34:11 pve kernel: vfio-pci 0000:0b:00.0: not ready 65535ms after FLR; giving up
Feb 10 21:34:11 pve kernel: vfio-pci 0000:0b:00.0: vfio_cap_init: hiding cap [email protected]
Feb 10 21:34:11 pve kernel: vfio-pci 0000:0b:00.0: vfio_cap_init: hiding cap [email protected]
Feb 10 21:34:11 pve kernel: vfio-pci 0000:0b:00.0: vfio_cap_init: hiding cap [email protected]
...
Feb 10 21:34:11 pve kernel: vfio-pci 0000:0b:00.0: can't change power state from D3cold to D0 (config space inaccessible)
Feb 10 21:34:11 pve pvestatd[4032]: VM 101 qmp command failed - VM 101 not running
Feb 10 21:34:12 pve kernel: vfio-pci 0000:0b:00.1: can't change power state from D0 to D3hot (config space inaccessible)
Feb 10 21:34:12 pve kernel: vfio-pci 0000:0b:00.0: invalid power transition (from D3cold to D3hot)

edit I use a reference RX 6700, I forget the actual make.
Some vbios references

https://pve.proxmox.com/wiki/Pci_passthrough#The_.27romfile.27_Option

https://forum.proxmox.com/threads/gpu-passthrough-tutorial-reference.34303

https://www.reddit.com/r/homelab/comments/b5xpua/the_ultimate_beginners_guide_to_gpu_passthrough/ See Step 4a (Optional): ROM File Issues

Thank you for the pointers. That first link looks quite promising. The first sign of trouble (when I boot the VM the second time) is related to the ROM:

I will try and move the card to the second slot this weekend to capture its BIOS into a file and then move it back.

From the reference though:

Some motherboards can’t passthrough GPUs on the first PCI(e) slot by default, because its vbios is shadowed during bootup. You need to capture its vBIOS when its working “normally” (i.e. installed in a different slot), then you can move the card to slot 1 and start the vm using the dumped vBIOS.

This makes it sound like leaving the card in the second slot and moving the RX480 to the first slot may also work (even without dumping the vBIOS). I will try both approaches. I wonder if having the RX480 on the first slot will cause that to stop working and the 6700XT to get fixed…

First of all, I have installed the RX480 in the first PCI slot and the 6700XT in the second one.

The behaviour is the same:

  • I can still use the RX480 in VMs which I can start/stop at will (vendor-reset works for it) and even switch from one VM using it to another and back (e.g. use it in a Windows VM, then stop that and use it in another Linux VM and then back again).
  • I can only use the 6700XT once, after that it becomes unusable (even if simply restarting the VM where it is working).

At this point I wanted to try and grab the BIOS of the 6700XT so with a fresh boot I went to the host system’s command line (the host was display via the RX480 on the text console) and tried:

[email protected]:~# lspci -nnk | grep 0c:
0c:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Navi 22 [1002:73df] (rev c1)
0c:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:ab28]
[email protected]:/sys/bus/pci/devices/0000:0c:00.0# cd /sys/bus/pci/devices/0000\:0c\:00.0/
[email protected]:/sys/bus/pci/devices/0000:0c:00.0# echo 1 > rom
[email protected]:/sys/bus/pci/devices/0000:0c:00.0# cat rom > /usr/share/kvm/aorus-vbios.bin
cat: rom: Input/output error
[email protected]:/sys/bus/pci/devices/0000:0c:00.0# dmesg | tail -n 1
[   52.574915] vfio-pci 0000:0c:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff

When trying to access it I noticed that last message on the screen so I used dmesg|tail -n 1 to show it here.

Any idea why I can’t save that BIOS?

@Log update: I booted from a Fedora 35 USB stick and was able to save my BIOS. I now use the card in the second slot with:

[email protected]:/etc/pve/qemu-server# grep aorus 101.conf
hostpci0: 0000:0c:00,pcie=1,x-vga=1,romfile=aorus-6700xt-elite.bin
[email protected]:/etc/pve/qemu-server# ls -l /usr/share/kvm/aorus-6700xt-elite.bin
-rw-r--r-- 1 root root 121856 Feb 11 13:09 /usr/share/kvm/aorus-6700xt-elite.bin

Still the same weird behavior… The “function level reset” does not work on the card. When I restart it:

Feb 11 13:18:24 pve pvedaemon[4297]: <[email protected]> starting task UPID:pve:000080BF:00012F0A:62066220:qmstart:101:[email protected]:
Feb 11 13:18:25 pve kernel: vfio-pci 0000:0c:00.0: timed out waiting for pending transaction; performing function level >
Feb 11 13:18:26 pve kernel: vfio-pci 0000:0c:00.0: not ready 1023ms after FLR; waiting
Feb 11 13:18:27 pve kernel: vfio-pci 0000:0c:00.0: not ready 2047ms after FLR; waiting
Feb 11 13:18:30 pve kernel: vfio-pci 0000:0c:00.0: not ready 4095ms after FLR; waiting
Feb 11 13:18:34 pve kernel: vfio-pci 0000:0c:00.0: not ready 8191ms after FLR; waiting
Feb 11 13:18:42 pve kernel: vfio-pci 0000:0c:00.0: not ready 16383ms after FLR; waiting
Feb 11 13:19:00 pve kernel: vfio-pci 0000:0c:00.0: not ready 32767ms after FLR; waiting
Feb 11 13:19:35 pve kernel: vfio-pci 0000:0c:00.0: not ready 65535ms after FLR; giving up

@robbbot :

By the way, the card is a Gigabyte AORUS Radeon RX 6700 XT ELITE 12G. I can’t believe my luck - does this really still have the FLR bug?

I have until 24th Feb to return this…

Like I said, you probably have something screwy set in your bios, unset/toggle bifurcation, resizable BAR, above 4g decoding, ACS along with some others I’ve documented here: Odds and Ends - Dev Oops - All things Arch, Debian and Python

The problem I imagine isn’t with the card but your configuration. You likely don’t need to mess with the vbios for any of this.

Sadly all my current options match what is on that page.

I have disabled:

  • ASPM
  • resizable BAR
  • above 4g decoding
  • ACS
  • ARI
  • 10-bit tag support
  • GMI/xGMI encryption control

I even removed the vendor-reset extension and left just one card (the 6700XT) installed alone on the machine…

It keeps throwing this error at shutdown:

Feb 11 18:09:28 pve kernel: vfio-pci 0000:0b:00.1: refused to change power state from D0 to D3hot
Feb 11 18:09:28 pve kernel: vfio-pci 0000:0b:00.0: refused to change power state from D0 to D3hot

Once that happens, any attempt to restart a VM using the 6700XT results in the failures to perform FLR…

This is of frustrating. Out of desparation I have:

  1. Updated my Gigabyte Aero G X570S motherboard to the latest F3 BIOS
  2. Uninstalled the “vendor-reset” kernel patch which I was using for the RX480

I have no idea what to try next…

Maybe try different combo’s of the settings being enabled, disabled and auto (which is different than enabled). All the 6XXX cards do not need the vendor-reset at all, which just leaves us with your configurations and your bios.

It also definitely didn’t work for me with bifurcation “enabled”, I think I had to leave on auto, but that likely is different between vendors / boards.

For my 6800xt I had to reset my bios to default and re-apply the settings to get it working properly. However I didn’t have the vendor-reset module in there since I was using my Vega64 as my host card.

Thanks @robbbot I will try different combinations but I am wondering why the 6700XT would have an issue with some BIOS setting when the RX480 just works.

At the moment I have removed the vendor-reset module and also the RX48i0, so all testing is now done with just the 6700XT.

Regarding bifurcation: my motherboard has 3 PCI slots, the first two share 16 lanes so when I install a card in those two I would get x8 lanes in each. This however is irrelevant now as I am using only the 6700XT with no other card installed.

Having said that, when using both cards along with vendor-reset I tried both Auto (original setting) and manually setting 2x8 (which is what Auto ends up doing really) and kept having the same issue with the 6700xt (whereas the RX480 just works every time)…

I will keep trying combinations and maybe ask on reddit as well and see if the audience there has any ideas…

Thanks again!

I have made some progress (maybe)?

I while searching for the message “refused to change power state from” I came across this post in proxmox forums which suggested enabling option disable_idle_d3 to avoid this, and so I’ve now added this:

[email protected]:~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:73df,1002:ab28,1002:67df,1002:aaf0 disable_idle_d3=1

This improved the situation in that the card no longer gets stuck in the power state, but still can only boot once. It’s just that all the messages regarding power-related failures are no longer showing up, so none of this anymore:

Feb 11 21:41:08 pve kernel: vfio-pci 0000:0b:00.0: refused to change power state from D0 to D3hot

Feb 11 21:44:38 pve QEMU[71457]: kvm: vfio: Unable to power on device, stuck in D3

Feb 11 21:47:28 pve kernel: vfio-pci 0000:0b:00.1: can't change power state from D0 to D3hot (config space inaccessible)

Now, I still see these the second time I try to start the VM:

Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap [email protected]
Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap [email protected]
Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap [email protected]
Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap [email protected]
Feb 12 15:22:05 pve kernel: vfio-pci 0000:0b:00.1: vfio_bar_restore: reset recovery - restoring BARs
Feb 12 15:22:05 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
# ... restoring BARs repeats several more times

And I still see these the third time I try to start the VM:

Feb 12 15:24:03 pve kernel: vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
Feb 12 15:24:05 pve kernel: vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
...

I will try downgrading my motherboard BIOS as suggested in that thread to see if it works. I had version F1 and updated to F3 but there is F2 to try.

1 Like

…and the ugly hack seems to allow me to restart the VM successfully without a full reboot of the host!

echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.0/remove
echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.1/remove
echo COMPUTER WILL GO TO SLEEP, PRESS POWER BUTTON TO RESUME
sleep 2 
echo -n mem > /sys/power/state
echo "1" | tee -a /sys/bus/pci/rescan

Basically you detach the 6700XT, do a fast sleep/resume (I need to press the power button) then rescan the PCI bus to re-detect the 6700XT. This works, I can then start another VM that uses it.

This is not ideal but it seems to indicate I need to try different motherboard BIOS versions in the hope they don’t exhibit this, and also look into anything power-related in the BIOS to try and tinker with…

4 Likes

Nice persistence! I’ll have to bookmark this for others that may encounter this.

At this point I am also treating this as a notepad.

Another interesting I found was this post about how to power-cycle PCIe devices from the perspective of an FPGA developer. I tried the script to see if I can power-cycle the 6700XT without suspending the entire machine but unfortunately it did not work (the rescan does not re-attach it)…

EDIT: also this nice reference of the kernel’s sysfs filesystem interface to the PCI bus and a reference to the kernel’s power management and the D0-D3 states.

1 Like

Can confirm that you’re not the only one with this issue. I have an Asus Dual RX 6700XT (Navi 22) with the exact same problem on an Asus Prime X570-Pro. The card refuses to go into D3Hot state again after shutdown. Turning the VM on again after VM shutdown will cause a black screen (code 43 behind the scenes) due to the Restoring BARs problem. At any third attempt to start the VM it’ll go into the classic Unknown PCI header type ‘127’ for device ‘0000:0b:00.0 problem and be completely locked until detaching the devices, sleeping the computer, wake it back up and rescan.

I’ve attempted multiple thing, most of the same stuff you have. BIOS settings. I’ve tried a different pci slot. Kernel parameters. Setting all the different VBIOS ROMs on the VM. Blacklisting AMDGPU. Disabling/unloading the driver in Windows before shutting down. Several other “fixes” projects that does the same. I’ve even attempted to force the card into the D3 state using pciset, which won’t work as AMD has some sort of special way of putting the card into that mode (hence vendor-reset).

After scouring the internet for hours I keep seeing this about the 6000 series not having this issue and denying it promptly. Yet people are having this issue with different 6000 vendors and the symptoms are identical to the classic reset-bug of the previous Navi iterations. I’m beginning to sincerely disbelieve that the issue is BIOS configuration or anything other than the reset bug being present in certain vendors’ cards, as my GTX 1080 passed completely fine on this motherboard until it died.

I am, just like you, willing to take any tips or hints. But at this point I’ve more or less given up as I cannot find a single point of reference anymore after combing the net for help. At least I can give one thing which is a mildly better reset script:

#!/bin/bash
echo 1 > /sys/bus/pci/devices/0000:0b:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0b:00.1/remove
echo "Suspending..."
rtcwake -m no -s 4
systemctl suspend
sleep 5s
echo 1 > /sys/bus/pci/rescan   
echo "Reset done"

This one will wake itself up after suspending, so you don’t have to press any buttons. It’s an ugly hack, but it is the only solution I’ve come up with unfortunately.

3 Likes

Thank you so much for sharing. If you ever do find some way of resetting the card please post back on this thread. I will do the same.

To be frannk though, I am seriously considering just returning the card. It was too expensive so I’d expect for it to work properly.