6700XT - reset bug?

karypid · February 11, 2022, 1:30pm

@Log update: I booted from a Fedora 35 USB stick and was able to save my BIOS. I now use the card in the second slot with:

root@pve:/etc/pve/qemu-server# grep aorus 101.conf
hostpci0: 0000:0c:00,pcie=1,x-vga=1,romfile=aorus-6700xt-elite.bin
root@pve:/etc/pve/qemu-server# ls -l /usr/share/kvm/aorus-6700xt-elite.bin
-rw-r--r-- 1 root root 121856 Feb 11 13:09 /usr/share/kvm/aorus-6700xt-elite.bin

Still the same weird behavior… The “function level reset” does not work on the card. When I restart it:

Feb 11 13:18:24 pve pvedaemon[4297]: <root@pam> starting task UPID:pve:000080BF:00012F0A:62066220:qmstart:101:root@pam:
Feb 11 13:18:25 pve kernel: vfio-pci 0000:0c:00.0: timed out waiting for pending transaction; performing function level >
Feb 11 13:18:26 pve kernel: vfio-pci 0000:0c:00.0: not ready 1023ms after FLR; waiting
Feb 11 13:18:27 pve kernel: vfio-pci 0000:0c:00.0: not ready 2047ms after FLR; waiting
Feb 11 13:18:30 pve kernel: vfio-pci 0000:0c:00.0: not ready 4095ms after FLR; waiting
Feb 11 13:18:34 pve kernel: vfio-pci 0000:0c:00.0: not ready 8191ms after FLR; waiting
Feb 11 13:18:42 pve kernel: vfio-pci 0000:0c:00.0: not ready 16383ms after FLR; waiting
Feb 11 13:19:00 pve kernel: vfio-pci 0000:0c:00.0: not ready 32767ms after FLR; waiting
Feb 11 13:19:35 pve kernel: vfio-pci 0000:0c:00.0: not ready 65535ms after FLR; giving up

@robbbot :

By the way, the card is a Gigabyte AORUS Radeon RX 6700 XT ELITE 12G. I can’t believe my luck - does this really still have the FLR bug?

I have until 24th Feb to return this…

robbbot · February 11, 2022, 2:48pm

Like I said, you probably have something screwy set in your bios, unset/toggle bifurcation, resizable BAR, above 4g decoding, ACS along with some others I’ve documented here: Odds and Ends - Dev Oops - All things Arch, Debian and Python

The problem I imagine isn’t with the card but your configuration. You likely don’t need to mess with the vbios for any of this.

karypid · February 11, 2022, 6:20pm

Sadly all my current options match what is on that page.

I have disabled:

ASPM
resizable BAR
above 4g decoding
ACS
ARI
10-bit tag support
GMI/xGMI encryption control

I even removed the vendor-reset extension and left just one card (the 6700XT) installed alone on the machine…

It keeps throwing this error at shutdown:

Feb 11 18:09:28 pve kernel: vfio-pci 0000:0b:00.1: refused to change power state from D0 to D3hot
Feb 11 18:09:28 pve kernel: vfio-pci 0000:0b:00.0: refused to change power state from D0 to D3hot

Once that happens, any attempt to restart a VM using the 6700XT results in the failures to perform FLR…

This is of frustrating. Out of desparation I have:

Updated my Gigabyte Aero G X570S motherboard to the latest F3 BIOS
Uninstalled the “vendor-reset” kernel patch which I was using for the RX480

I have no idea what to try next…

robbbot · February 11, 2022, 6:41pm

Maybe try different combo’s of the settings being enabled, disabled and auto (which is different than enabled). All the 6XXX cards do not need the vendor-reset at all, which just leaves us with your configurations and your bios.

It also definitely didn’t work for me with bifurcation “enabled”, I think I had to leave on auto, but that likely is different between vendors / boards.

For my 6800xt I had to reset my bios to default and re-apply the settings to get it working properly. However I didn’t have the vendor-reset module in there since I was using my Vega64 as my host card.

karypid · February 11, 2022, 8:41pm

Thanks @robbbot I will try different combinations but I am wondering why the 6700XT would have an issue with some BIOS setting when the RX480 just works.

At the moment I have removed the vendor-reset module and also the RX48i0, so all testing is now done with just the 6700XT.

Regarding bifurcation: my motherboard has 3 PCI slots, the first two share 16 lanes so when I install a card in those two I would get x8 lanes in each. This however is irrelevant now as I am using only the 6700XT with no other card installed.

Having said that, when using both cards along with vendor-reset I tried both Auto (original setting) and manually setting 2x8 (which is what Auto ends up doing really) and kept having the same issue with the 6700xt (whereas the RX480 just works every time)…

I will keep trying combinations and maybe ask on reddit as well and see if the audience there has any ideas…

Thanks again!

karypid · February 12, 2022, 3:27pm

I have made some progress (maybe)?

I while searching for the message “refused to change power state from” I came across this post in proxmox forums which suggested enabling option disable_idle_d3 to avoid this, and so I’ve now added this:

root@pve:~# cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=1002:73df,1002:ab28,1002:67df,1002:aaf0 disable_idle_d3=1

This improved the situation in that the card no longer gets stuck in the power state, but still can only boot once. It’s just that all the messages regarding power-related failures are no longer showing up, so none of this anymore:

Feb 11 21:41:08 pve kernel: vfio-pci 0000:0b:00.0: refused to change power state from D0 to D3hot

Feb 11 21:44:38 pve QEMU[71457]: kvm: vfio: Unable to power on device, stuck in D3

Feb 11 21:47:28 pve kernel: vfio-pci 0000:0b:00.1: can't change power state from D0 to D3hot (config space inaccessible)

Now, I still see these the second time I try to start the VM:

Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Feb 12 15:22:04 pve kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Feb 12 15:22:05 pve kernel: vfio-pci 0000:0b:00.1: vfio_bar_restore: reset recovery - restoring BARs
Feb 12 15:22:05 pve kernel: vfio-pci 0000:0b:00.0: vfio_bar_restore: reset recovery - restoring BARs
# ... restoring BARs repeats several more times

And I still see these the third time I try to start the VM:

Feb 12 15:24:03 pve kernel: vfio-pci 0000:0b:00.0: timed out waiting for pending transaction; performing function level reset anyway
Feb 12 15:24:05 pve kernel: vfio-pci 0000:0b:00.0: not ready 1023ms after FLR; waiting
...

I will try downgrading my motherboard BIOS as suggested in that thread to see if it works. I had version F1 and updated to F3 but there is F2 to try.

karypid · February 12, 2022, 3:36pm

…and the ugly hack seems to allow me to restart the VM successfully without a full reboot of the host!

echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.0/remove
echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.1/remove
echo COMPUTER WILL GO TO SLEEP, PRESS POWER BUTTON TO RESUME
sleep 2 
echo -n mem > /sys/power/state
echo "1" | tee -a /sys/bus/pci/rescan

Basically you detach the 6700XT, do a fast sleep/resume (I need to press the power button) then rescan the PCI bus to re-detect the 6700XT. This works, I can then start another VM that uses it.

This is not ideal but it seems to indicate I need to try different motherboard BIOS versions in the hope they don’t exhibit this, and also look into anything power-related in the BIOS to try and tinker with…

Log · February 12, 2022, 10:47pm

Nice persistence! I’ll have to bookmark this for others that may encounter this.

karypid · February 14, 2022, 12:34pm

At this point I am also treating this as a notepad.

Another interesting I found was this post about how to power-cycle PCIe devices from the perspective of an FPGA developer. I tried the script to see if I can power-cycle the 6700XT without suspending the entire machine but unfortunately it did not work (the rescan does not re-attach it)…

EDIT: also this nice reference of the kernel’s sysfs filesystem interface to the PCI bus and a reference to the kernel’s power management and the D0-D3 states.

Mechanical · February 17, 2022, 3:35pm

Can confirm that you’re not the only one with this issue. I have an Asus Dual RX 6700XT (Navi 22) with the exact same problem on an Asus Prime X570-Pro. The card refuses to go into D3Hot state again after shutdown. Turning the VM on again after VM shutdown will cause a black screen (code 43 behind the scenes) due to the Restoring BARs problem. At any third attempt to start the VM it’ll go into the classic Unknown PCI header type ‘127’ for device ‘0000:0b:00.0 problem and be completely locked until detaching the devices, sleeping the computer, wake it back up and rescan.

I’ve attempted multiple thing, most of the same stuff you have. BIOS settings. I’ve tried a different pci slot. Kernel parameters. Setting all the different VBIOS ROMs on the VM. Blacklisting AMDGPU. Disabling/unloading the driver in Windows before shutting down. Several other “fixes” projects that does the same. I’ve even attempted to force the card into the D3 state using pciset, which won’t work as AMD has some sort of special way of putting the card into that mode (hence vendor-reset).

After scouring the internet for hours I keep seeing this about the 6000 series not having this issue and denying it promptly. Yet people are having this issue with different 6000 vendors and the symptoms are identical to the classic reset-bug of the previous Navi iterations. I’m beginning to sincerely disbelieve that the issue is BIOS configuration or anything other than the reset bug being present in certain vendors’ cards, as my GTX 1080 passed completely fine on this motherboard until it died.

I am, just like you, willing to take any tips or hints. But at this point I’ve more or less given up as I cannot find a single point of reference anymore after combing the net for help. At least I can give one thing which is a mildly better reset script:

#!/bin/bash
echo 1 > /sys/bus/pci/devices/0000:0b:00.0/remove
echo 1 > /sys/bus/pci/devices/0000:0b:00.1/remove
echo "Suspending..."
rtcwake -m no -s 4
systemctl suspend
sleep 5s
echo 1 > /sys/bus/pci/rescan   
echo "Reset done"

This one will wake itself up after suspending, so you don’t have to press any buttons. It’s an ugly hack, but it is the only solution I’ve come up with unfortunately.

karypid · February 17, 2022, 5:37pm

Thank you so much for sharing. If you ever do find some way of resetting the card please post back on this thread. I will do the same.

To be frannk though, I am seriously considering just returning the card. It was too expensive so I’d expect for it to work properly.

karypid · February 17, 2022, 10:17pm

Just located a post about a Sapphire Nitro+ 6700XT also not working. This is now a total of 3 models exhibiting the bug:

Gigabyte Aorus 6700XT Elite
Asus Dual RX 6700XT
Sapphire RX 6700XT Nitro+

It seems to me that Wendel needs to retract that video where he cheered about the 6000 series not having the bug. This is too many samples now from people that seem to know what they’re doing (all three of us have other working cards but can’t get the 6700XT specifically to work).

Hope Wendell is watching this forum, or somebody brings it to his attention.

karypid · February 19, 2022, 9:46pm

An interesting patch has landed which mentions some ASPM fixes for Navi:

Mario Limonciello (4):
      drm/amd: smu7: downgrade voltage error to info
      drm/amd: Check if ASPM is enabled from PCIe subsystem
      drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
      drm/amd: Use amdgpu_device_should_use_aspm on navi umd pstate switching

I may look into compiling a custom kernel for Proxmox but if anyone can try it out and report if they’ve had any luck, please do…

MapManagement · March 13, 2022, 6:58pm

Hey, I encountered the exact same bug so you can probably extend your list.
I switched from a Vega 56 Sapphire Pulse to a 6700XT PowerColor Red Devil. Like you, I don’t know what to try especially because most people simply just say, that there’s no reset bug for AMD 6000 gen anymore. Hopefully there either already is some sort of fix or there will be one in the near future.

itsmyrun · March 14, 2022, 7:27am

Chiming in here to add, I encountered the exact same bug when I try to power cycle a VM which was using my Radeon RX 5600XT.

The first boot up works well, then the first poweroff throws these entries:

[13003.863261] vfio-pci 0000:43:00.1: can't change power state from D0 to D3hot (config space inaccessible)
[13003.863323] vfio-pci 0000:43:00.0: can't change power state from D0 to D3hot (config space inaccessible)

The VM is offline successfully, and Proxmox continues running.
If I try to power up the VM again, I get the following dmesg stream until my proxmox host goes unresponsive:

[14146.485263] vfio-pci 0000:43:00.0: timed out waiting for pending transaction; performing function level reset anyway
[14147.733256] vfio-pci 0000:43:00.0: not ready 1023ms after FLR; waiting
[14148.789238] vfio-pci 0000:43:00.0: not ready 2047ms after FLR; waiting
[14150.901247] vfio-pci 0000:43:00.0: not ready 4095ms after FLR; waiting
[14155.253193] vfio-pci 0000:43:00.0: not ready 8191ms after FLR; waiting
[14163.701139] vfio-pci 0000:43:00.0: not ready 16383ms after FLR; waiting
[14181.621003] vfio-pci 0000:43:00.0: not ready 32767ms after FLR; waiting
[14216.436746] vfio-pci 0000:43:00.0: not ready 65535ms after FLR; giving up
[14217.445396] device tap101i0 entered promiscuous mode
[14217.484515] fwbr101i0: port 1(fwln101i0) entered blocking state
[14217.484522] fwbr101i0: port 1(fwln101i0) entered disabled state
[14217.484594] device fwln101i0 entered promiscuous mode
[14217.484703] fwbr101i0: port 1(fwln101i0) entered blocking state
[14217.484706] fwbr101i0: port 1(fwln101i0) entered forwarding state
[14217.489338] vmbr0: port 2(fwpr101p0) entered blocking state
[14217.489343] vmbr0: port 2(fwpr101p0) entered disabled state
[14217.489403] device fwpr101p0 entered promiscuous mode
[14217.489460] vmbr0: port 2(fwpr101p0) entered blocking state
[14217.489463] vmbr0: port 2(fwpr101p0) entered forwarding state
[14217.494011] fwbr101i0: port 2(tap101i0) entered blocking state
[14217.494017] fwbr101i0: port 2(tap101i0) entered disabled state
[14217.494119] fwbr101i0: port 2(tap101i0) entered blocking state
[14217.494122] fwbr101i0: port 2(tap101i0) entered forwarding state
[14217.599078] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599085] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x289ff001
[14217.599131] DMAR: DRHD: handling fault status reg 40
[14217.599176] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ab4
[14217.599261] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.599334] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599335] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x28c0000c
[14217.599387] DMAR: DRHD: handling fault status reg 40
[14217.599400] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051abc
[14217.599404] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.599540] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599541] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x28fff001
[14217.599543] DMAR: DRHD: handling fault status reg 40
[14217.599572] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ac4
[14217.599575] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.599745] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599746] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x29c0000b
[14217.599747] DMAR: DRHD: handling fault status reg 40
[14217.599787] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051acc
[14217.599790] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600002] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600002] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x29fff001
[14217.600004] DMAR: DRHD: handling fault status reg 40
[14217.600047] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ad4
[14217.600058] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600156] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600202] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x2a40000b
[14217.600209] DMAR: DRHD: handling fault status reg 40
[14217.600244] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051adc
[14217.600357] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600415] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600416] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x2a7ff001
[14217.600416] DMAR: DRHD: handling fault status reg 40
[14217.600445] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ae4
[14217.600448] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600669] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600670] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x2ac0000a
[14217.600671] DMAR: DRHD: handling fault status reg 40
[14217.600721] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051aec
[14217.600723] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600925] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600926] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x2adff001
[14217.600957] DMAR: DRHD: handling fault status reg 40
[14217.600996] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051af4
[14217.600999] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.601131] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.601131] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x2b00000a
[14217.601196] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051afc
[14217.601208] DMAR: DRHD: handling fault status reg 40
<... these continue repeating...>

itsmyrun · March 14, 2022, 9:53am

This may not be directly applicable because you’re on a 6700XT, but it might seed some ideas…
I fixed my instance of this issue on my Radeon RX 5600XT with the help of github project gnif/vendor-reset. (I can’t post links, likely my account is too new)

I did an apt install -y dkms pve-headers, and then followed the instructions on the readme, and it all worked for me.

I used the latest version of the project from the master branch (master is currently pointing to commit 7d43285 from 2021-10-19), and it worked flawlessly for me.
The “released” version (v0.1.0 from 2021-01-23) did not work for me.

I now have no issues power cycling the VM which receives the passthrough Radeon RX 5600XT GPU.

MapManagement · March 14, 2022, 1:02pm

The vendor-reset only works for older cards, as far as i know. I used it for my Vega 56 before upgrading to a 6700XT and it worked. I already tested the 6700XT with and without the vendor reset but it had no impact at all. If you want to get more information about my situation, you can check this reddit thread where I explained everything a bit more detailed. I’m actually using Arch instead of Proxmox FYI: https://www.reddit.com/r/VFIO/comments/tcxhkw/6700xt_no_ouput_after_reboot/
I would be really grateful if you can actually help me

karypid · March 14, 2022, 5:32pm

What is the motherboard make/model that you are using?

Also, do you see ANY output at all on the screen when starting up with “video=efifb:off” or is the screen completely blank?

MapManagement · March 14, 2022, 6:28pm

Hey, I’m using a B550 Aorus Pro from Gigabyte.
The monitor which is connected to the gust GPU is completely blank until I start a VM, even at boot time. Any other monitor that is connected to my host GPU obviously shows some stuff.

karypid · March 14, 2022, 11:56pm

I’m also using a Gigabyte motherboard, the X570S Aero G.

I am thinking this may be related to the motherboard and not the GPU. As others have pointed out, BIOS settings can be important but no matter what I tried it has failed.

When I boot up I see “EFI stub: loaded initrd from command line option” but nothing else after that. I see it across both screens (I have the 6700XT alongside an RX550 each with a scren connected) .