6700XT - reset bug?

Just located a post about a Sapphire Nitro+ 6700XT also not working. This is now a total of 3 models exhibiting the bug:

  • Gigabyte Aorus 6700XT Elite
  • Asus Dual RX 6700XT
  • Sapphire RX 6700XT Nitro+

It seems to me that Wendel needs to retract that video where he cheered about the 6000 series not having the bug. This is too many samples now from people that seem to know what they’re doing (all three of us have other working cards but can’t get the 6700XT specifically to work).

Hope Wendell is watching this forum, or somebody brings it to his attention.

1 Like

An interesting patch has landed which mentions some ASPM fixes for Navi:

Mario Limonciello (4):
      drm/amd: smu7: downgrade voltage error to info
      drm/amd: Check if ASPM is enabled from PCIe subsystem
      drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
      drm/amd: Use amdgpu_device_should_use_aspm on navi umd pstate switching

I may look into compiling a custom kernel for Proxmox but if anyone can try it out and report if they’ve had any luck, please do…

2 Likes

Hey, I encountered the exact same bug so you can probably extend your list.
I switched from a Vega 56 Sapphire Pulse to a 6700XT PowerColor Red Devil. Like you, I don’t know what to try especially because most people simply just say, that there’s no reset bug for AMD 6000 gen anymore. Hopefully there either already is some sort of fix or there will be one in the near future.

Chiming in here to add, I encountered the exact same bug when I try to power cycle a VM which was using my Radeon RX 5600XT.

The first boot up works well, then the first poweroff throws these entries:

[13003.863261] vfio-pci 0000:43:00.1: can't change power state from D0 to D3hot (config space inaccessible)
[13003.863323] vfio-pci 0000:43:00.0: can't change power state from D0 to D3hot (config space inaccessible)

The VM is offline successfully, and Proxmox continues running.
If I try to power up the VM again, I get the following dmesg stream until my proxmox host goes unresponsive:

[14146.485263] vfio-pci 0000:43:00.0: timed out waiting for pending transaction; performing function level reset anyway
[14147.733256] vfio-pci 0000:43:00.0: not ready 1023ms after FLR; waiting
[14148.789238] vfio-pci 0000:43:00.0: not ready 2047ms after FLR; waiting
[14150.901247] vfio-pci 0000:43:00.0: not ready 4095ms after FLR; waiting
[14155.253193] vfio-pci 0000:43:00.0: not ready 8191ms after FLR; waiting
[14163.701139] vfio-pci 0000:43:00.0: not ready 16383ms after FLR; waiting
[14181.621003] vfio-pci 0000:43:00.0: not ready 32767ms after FLR; waiting
[14216.436746] vfio-pci 0000:43:00.0: not ready 65535ms after FLR; giving up
[14217.445396] device tap101i0 entered promiscuous mode
[14217.484515] fwbr101i0: port 1(fwln101i0) entered blocking state
[14217.484522] fwbr101i0: port 1(fwln101i0) entered disabled state
[14217.484594] device fwln101i0 entered promiscuous mode
[14217.484703] fwbr101i0: port 1(fwln101i0) entered blocking state
[14217.484706] fwbr101i0: port 1(fwln101i0) entered forwarding state
[14217.489338] vmbr0: port 2(fwpr101p0) entered blocking state
[14217.489343] vmbr0: port 2(fwpr101p0) entered disabled state
[14217.489403] device fwpr101p0 entered promiscuous mode
[14217.489460] vmbr0: port 2(fwpr101p0) entered blocking state
[14217.489463] vmbr0: port 2(fwpr101p0) entered forwarding state
[14217.494011] fwbr101i0: port 2(tap101i0) entered blocking state
[14217.494017] fwbr101i0: port 2(tap101i0) entered disabled state
[14217.494119] fwbr101i0: port 2(tap101i0) entered blocking state
[14217.494122] fwbr101i0: port 2(tap101i0) entered forwarding state
[14217.599078] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599085] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x289ff001
[14217.599131] DMAR: DRHD: handling fault status reg 40
[14217.599176] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ab4
[14217.599261] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.599334] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599335] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x28c0000c
[14217.599387] DMAR: DRHD: handling fault status reg 40
[14217.599400] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051abc
[14217.599404] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.599540] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599541] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x28fff001
[14217.599543] DMAR: DRHD: handling fault status reg 40
[14217.599572] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ac4
[14217.599575] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.599745] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.599746] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x29c0000b
[14217.599747] DMAR: DRHD: handling fault status reg 40
[14217.599787] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051acc
[14217.599790] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600002] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600002] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x29fff001
[14217.600004] DMAR: DRHD: handling fault status reg 40
[14217.600047] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ad4
[14217.600058] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600156] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600202] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x2a40000b
[14217.600209] DMAR: DRHD: handling fault status reg 40
[14217.600244] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051adc
[14217.600357] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600415] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600416] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x2a7ff001
[14217.600416] DMAR: DRHD: handling fault status reg 40
[14217.600445] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051ae4
[14217.600448] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600669] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600670] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x2ac0000a
[14217.600671] DMAR: DRHD: handling fault status reg 40
[14217.600721] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051aec
[14217.600723] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.600925] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.600926] DMAR: QI HEAD: Device-TLB Invalidation qw0 = 0x430000000003, qw1 = 0x2adff001
[14217.600957] DMAR: DRHD: handling fault status reg 40
[14217.600996] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051af4
[14217.600999] DMAR: Invalidation Time-out Error (ITE) cleared
[14217.601131] DMAR: VT-d detected Invalidation Time-out Error: SID 0
[14217.601131] DMAR: QI HEAD: IOTLB Invalidation qw0 = 0x200f2, qw1 = 0x2b00000a
[14217.601196] DMAR: QI PRIOR: Invalidation Wait qw0 = 0x200000025, qw1 = 0x100051afc
[14217.601208] DMAR: DRHD: handling fault status reg 40
<... these continue repeating...>

This may not be directly applicable because you’re on a 6700XT, but it might seed some ideas…
I fixed my instance of this issue on my Radeon RX 5600XT with the help of github project gnif/vendor-reset. (I can’t post links, likely my account is too new)

I did an apt install -y dkms pve-headers, and then followed the instructions on the readme, and it all worked for me.

I used the latest version of the project from the master branch (master is currently pointing to commit 7d43285 from 2021-10-19), and it worked flawlessly for me.
The “released” version (v0.1.0 from 2021-01-23) did not work for me.

I now have no issues power cycling the VM which receives the passthrough Radeon RX 5600XT GPU.

1 Like

The vendor-reset only works for older cards, as far as i know. I used it for my Vega 56 before upgrading to a 6700XT and it worked. I already tested the 6700XT with and without the vendor reset but it had no impact at all. If you want to get more information about my situation, you can check this reddit thread where I explained everything a bit more detailed. I’m actually using Arch instead of Proxmox FYI: https://www.reddit.com/r/VFIO/comments/tcxhkw/6700xt_no_ouput_after_reboot/
I would be really grateful if you can actually help me :slight_smile:

What is the motherboard make/model that you are using?

Also, do you see ANY output at all on the screen when starting up with “video=efifb:off” or is the screen completely blank?

Hey, I’m using a B550 Aorus Pro from Gigabyte.
The monitor which is connected to the gust GPU is completely blank until I start a VM, even at boot time. Any other monitor that is connected to my host GPU obviously shows some stuff.

I’m also using a Gigabyte motherboard, the X570S Aero G.

I am thinking this may be related to the motherboard and not the GPU. As others have pointed out, BIOS settings can be important but no matter what I tried it has failed.

When I boot up I see “EFI stub: loaded initrd from command line option” but nothing else after that. I see it across both screens (I have the 6700XT alongside an RX550 each with a scren connected) .

2 Likes

If anyone is still following this thread, I have a new development. An X.org developer on phoronix claimed the 6000 series support SBR and that it should be possible to properly reset 6700XTs using that method.

I have tried (unsuccessfully so far), but perhaps someone else may be more lucky and get it to work for them? If you want to give it a go, I’ve posted instructions on THIS reddit thread to help reach more of an audience. I’d appreciate some feedback from people that try it, on whether it helped resolve this for them.

1 Like

Hope? AMD Looking To Improve The GPU Reset Experience Under Linux

looks like thats for amdgpu, not vfio.

Yeah, it’s also about reporting on a reset after a problem occurred due to some buggy application. Not what we’re after…

@Mechanical Do you by any chance have a post in forums (reddit / level1techs) where you discussed your issue with the Asus Dual RX 6700XT along with any steps you took to debug/fix it? Something like this thread, but that you created to discuss your own case?

Got some attention from AMD employees that have generally been very active in Reddit forums.

Hopefully they’ll raise awareness internally in the company…

1 Like

Further update. I contacted a store and agreed I can order cards and test them at home, then return them if they don’t work and order another model.

If you are in the UK I highly recommend OCUK as explained here, where I have compiled a list of cards that are reported working, along with cards reported as NOT working.

In short: my configuration was 100% correct. I found two cards [EDIT: two 6000 series cards, plus my older RX550 and RX480 with vendor-reset patch] that “just work” as opposed to my Gigabyte Aorus Elite 6700XT which refused to work no matter what.

Clearly there’s something vendor/board-specific that causes this. Please refer to my Reddit post for a list of reported working/non-working cards before you purchase.

I personally highly recommend OCUK for a no-fuss experience if you are in the UK.

1 Like

I can confirm the XFX 6700xt SWFT is affected by this also.

1 Like

I have the same card, but non XL version and same problem.

For any one else considering it don’t get the XFX RX6700 SWFT 309 (Model Number RX-67XLKW)

I had some similar issue with my gigabyte rx580 being on the 2nd slot on Asus Strix x570-f. I’ve created a repo how I managed to fix it. Viktor Koteski / proxmox-gpu-reset · GitLab

1 Like

I have a Sapphire Nitro+ 6800XT and no issues at all.
in the past you could flash a bios from another vendor, but I don’t think it works with RDNA2

Edit: looks like there is a flash tool for RDNA2

Maybe it’s worth a shoot to test another Bios only by loading it via libvirt

https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF

> <hostdev>
>      ...
>      <rom file='/path/to/your/gpu/bios.bin'/>
>      ...
>    </hostdev>