…and the ugly hack seems to allow me to restart the VM successfully without a full reboot of the host!
echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.0/remove
echo "1" | tee -a /sys/bus/pci/devices/0000\:0b\:00.1/remove
echo COMPUTER WILL GO TO SLEEP, PRESS POWER BUTTON TO RESUME
sleep 2
echo -n mem > /sys/power/state
echo "1" | tee -a /sys/bus/pci/rescan
Basically you detach the 6700XT, do a fast sleep/resume (I need to press the power button) then rescan the PCI bus to re-detect the 6700XT. This works, I can then start another VM that uses it.
This is not ideal but it seems to indicate I need to try different motherboard BIOS versions in the hope they don’t exhibit this, and also look into anything power-related in the BIOS to try and tinker with…
Can confirm that you’re not the only one with this issue. I have an Asus Dual RX 6700XT (Navi 22) with the exact same problem on an Asus Prime X570-Pro. The card refuses to go into D3Hot state again after shutdown. Turning the VM on again after VM shutdown will cause a black screen (code 43 behind the scenes) due to the Restoring BARs problem. At any third attempt to start the VM it’ll go into the classic Unknown PCI header type ‘127’ for device ‘0000:0b:00.0 problem and be completely locked until detaching the devices, sleeping the computer, wake it back up and rescan.
I’ve attempted multiple thing, most of the same stuff you have. BIOS settings. I’ve tried a different pci slot. Kernel parameters. Setting all the different VBIOS ROMs on the VM. Blacklisting AMDGPU. Disabling/unloading the driver in Windows before shutting down. Several other “fixes” projects that does the same. I’ve even attempted to force the card into the D3 state using pciset, which won’t work as AMD has some sort of special way of putting the card into that mode (hence vendor-reset).
After scouring the internet for hours I keep seeing this about the 6000 series not having this issue and denying it promptly. Yet people are having this issue with different 6000 vendors and the symptoms are identical to the classic reset-bug of the previous Navi iterations. I’m beginning to sincerely disbelieve that the issue is BIOS configuration or anything other than the reset bug being present in certain vendors’ cards, as my GTX 1080 passed completely fine on this motherboard until it died.
I am, just like you, willing to take any tips or hints. But at this point I’ve more or less given up as I cannot find a single point of reference anymore after combing the net for help. At least I can give one thing which is a mildly better reset script:
This one will wake itself up after suspending, so you don’t have to press any buttons. It’s an ugly hack, but it is the only solution I’ve come up with unfortunately.
It seems to me that Wendel needs to retract that video where he cheered about the 6000 series not having the bug. This is too many samples now from people that seem to know what they’re doing (all three of us have other working cards but can’t get the 6700XT specifically to work).
Hope Wendell is watching this forum, or somebody brings it to his attention.
Mario Limonciello (4):
drm/amd: smu7: downgrade voltage error to info
drm/amd: Check if ASPM is enabled from PCIe subsystem
drm/amd: Refactor `amdgpu_aspm` to be evaluated per device
drm/amd: Use amdgpu_device_should_use_aspm on navi umd pstate switching
I may look into compiling a custom kernel for Proxmox but if anyone can try it out and report if they’ve had any luck, please do…
Hey, I encountered the exact same bug so you can probably extend your list.
I switched from a Vega 56 Sapphire Pulse to a 6700XT PowerColor Red Devil. Like you, I don’t know what to try especially because most people simply just say, that there’s no reset bug for AMD 6000 gen anymore. Hopefully there either already is some sort of fix or there will be one in the near future.
Chiming in here to add, I encountered the exact same bug when I try to power cycle a VM which was using my Radeon RX 5600XT.
The first boot up works well, then the first poweroff throws these entries:
[13003.863261] vfio-pci 0000:43:00.1: can't change power state from D0 to D3hot (config space inaccessible)
[13003.863323] vfio-pci 0000:43:00.0: can't change power state from D0 to D3hot (config space inaccessible)
The VM is offline successfully, and Proxmox continues running.
If I try to power up the VM again, I get the following dmesg stream until my proxmox host goes unresponsive:
This may not be directly applicable because you’re on a 6700XT, but it might seed some ideas…
I fixed my instance of this issue on my Radeon RX 5600XT with the help of github project gnif/vendor-reset. (I can’t post links, likely my account is too new)
I did an apt install -y dkms pve-headers, and then followed the instructions on the readme, and it all worked for me.
I used the latest version of the project from the master branch (master is currently pointing to commit 7d43285 from 2021-10-19), and it worked flawlessly for me.
The “released” version (v0.1.0 from 2021-01-23) did not work for me.
I now have no issues power cycling the VM which receives the passthrough Radeon RX 5600XT GPU.
The vendor-reset only works for older cards, as far as i know. I used it for my Vega 56 before upgrading to a 6700XT and it worked. I already tested the 6700XT with and without the vendor reset but it had no impact at all. If you want to get more information about my situation, you can check this reddit thread where I explained everything a bit more detailed. I’m actually using Arch instead of Proxmox FYI: https://www.reddit.com/r/VFIO/comments/tcxhkw/6700xt_no_ouput_after_reboot/
I would be really grateful if you can actually help me
Hey, I’m using a B550 Aorus Pro from Gigabyte.
The monitor which is connected to the gust GPU is completely blank until I start a VM, even at boot time. Any other monitor that is connected to my host GPU obviously shows some stuff.
I’m also using a Gigabyte motherboard, the X570S Aero G.
I am thinking this may be related to the motherboard and not the GPU. As others have pointed out, BIOS settings can be important but no matter what I tried it has failed.
When I boot up I see “EFI stub: loaded initrd from command line option” but nothing else after that. I see it across both screens (I have the 6700XT alongside an RX550 each with a scren connected) .
If anyone is still following this thread, I have a new development. An X.org developer on phoronix claimed the 6000 series support SBR and that it should be possible to properly reset 6700XTs using that method.
I have tried (unsuccessfully so far), but perhaps someone else may be more lucky and get it to work for them? If you want to give it a go, I’ve posted instructions on THIS reddit thread to help reach more of an audience. I’d appreciate some feedback from people that try it, on whether it helped resolve this for them.
@Mechanical Do you by any chance have a post in forums (reddit / level1techs) where you discussed your issue with the Asus Dual RX 6700XT along with any steps you took to debug/fix it? Something like this thread, but that you created to discuss your own case?