Vega Reset Bug... Still an issue?

I am looking to upgrade my R9 290x. It’s still going pretty strong, but will reach really high temps in some games and will sometimes start artifacting.

I game in a Windows 10 VM with VFIO passthrough and have seen that this can be a problem with Vega cards. There is a deal on the Vega 56 at the moment for £300 with 3 AAA games.

I was originally planning on waiting for Navi. But who knows when that will arrive. And, with this deal, I am very tempted.

My main question is. Is the reset bug still a problem with VFIO?
If so, how much of a problem?

I can safely say yes it is still a problem, but the impression I have gotten from all the posts I have read is it seems to be hit and miss and related to the card manufacturer and maybe card BIOS
I have a PowerColor RedDevil Vega 64 and it suffers from the bug

I did find something with a Vega 64 (ASUS Strix) on redit where someone was able to reset their vega by removing it, then re-scanning, kind of like removing a device in device manager in windows then re-scanning for hardware changes to reinstall the device.

These are the steps the person took

Power off the Vega GPU
echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:0a\:00.0/remove # <-GPU
echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:0a\:00.1/remove # <-HDMI/DP audio device
where “a” is the device address/ID
e.g
echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:04\:00.0/remove
echo "1" | sudo tee -a /sys/bus/pci/devices/0000\:04\:00.1/remove

Suspend to RAM

sudo systemctl suspend
other systemctl commands may work but haven’t tried yet

log back in to your desktop environment and rescan PCIe devices by entering the following

echo “1” | sudo tee -a /sys/bus/pci/rescan

or the following 2 commands

sudo chmod 777 /sys/bus/pci/rescan
sudo echo 1 > /sys/bus/pci/rescan

Check it has reset
lspci -vv | grep vfio -B 12

Restart Lib-virt so the virt manager can see the GPU again

sudo systemctl stop libvirt-bin
sudo systemctl stop libvirt-bin.socket
sudo systemctl start libvirt-bin

The person said this has to done as soon as the VM is powered off otherwise it will not reset.

EDIT: I have checked again on my V64 and this process works. You shouldn’t have to restart libvirt. I had to the first time because I deleted the V64 from my VM when it didn’t output video.
You have to run the process as soon as you shut down the VM.
I would still suggest not getting the PowerColor Red Devil it has a tendency to shut itself off. I may sell my card and get a different brand.

Here is a script I have just put together so need to test to make sure I can just run everything from commandline in one step

#!/bin/bash

# copy this file to /usr/bin/reset_vega.sh
# This script must be run immediately after you shut down the VM. It doesn't work if the GPU has been left for too long
# following a shut down. It doesn't work if the VM is rebooted.
# to run simply open terminal and run: cat cat /usr/bin/reset_vega.sh
# Remove/Power off the Vega GPU like uninstalling devices in Windows device manager
echo “1” | sudo tee -a /sys/bus/pci/devices/0000:0d:00.0/remove
echo “1” | sudo tee -a /sys/bus/pci/devices/0000:0d:00.1/remove

# Suspend to RAM
systemctl suspend

# Read any user input
read input

# Change permisions of the PCI rescan command
sudo chmod 777 /sys/bus/pci/rescan

# Rescan PCI devices to reinitialise the vega GPU
sudo echo 1 > /sys/bus/pci/rescan

# This line is replaced by the last 2 because it throws invalid argument errors
# echo “1” | sudo tee -a /sys/bus/pci/rescan
3 Likes

Many thanks for your reply . I’m gonna wait and see what AMD announce at CES :pray:. This is great info, and the workaround looks pretty painless.

No problem,
I am waiting for CES too. Want to know what that rumored RX 3080 can do especially if AdoreTV’s leak was accurate and it is Vega 64+15%. If so will be replacing my V64 for that.
My guesses are that the reset bug is in someway linked to HBM/HBM2 since the only cards people have had the issue with consistently are Fury and Vega

I can reboot my windows 10 guest with Vega 64 passed through since latest update without any config change.
And since Kernel 4.19 introduced AMD GPU reset fixes, I think it did apply to VM resets (not just amdgpu driver only).
Or it could have been due to updating the windows guest AMD driver (19.1), or to win 10 fall creator update…

Anyway it works and that’s good.

1 Like

no, kernel 4.20.x doesn’t resolve the issue. if it is working for you without additional changes, it is either luck or you are mistaken.

running kernel 4.20.4 here and still having reset issues without disabling D3 idle state in the GPU.

if I do the suspend2ram trick then I can shutdown VM and rebind to AMDGPU in host… but I can’t unbind from amdgpu. so I don’t see how its PCIe reset fixes work at all.

if you see other threads where gnif has commented, he says the reset code in the amdgpu driver is incomplete and comments in the source indicate such.

2 Likes

I used this script, inserting ‘20’ where you have ‘0d’ (that’s where my guest card is).

This is what came up:

“1”
tee: '/sys/bus/pci/devices/0000:20:00.0/remove': Invalid argument
“1”
tee: '/sys/bus/pci/devices/0000:20:00.1/remove': Invalid argument

Also, no user input woke it up from suspend - I had to push the power button.

Thoughts?

I am having a slightly different experience on an RX470 (MSI). I can issue the remove command without error, and it does indeed work. But rescan either does nothing or returns the “invalid argument” error. Even after suspending and waking, rescan does not bring back the card according to lspci.
I started a thread about my particular issues here: Yet another AMD reset bug thread (RX470)

I guess I should put here that GPU passthrough is working, and I have played some games in the VM.

I have not yet shut down the VM, used this script, then tried to restart it yet. I will report back when I have.

Sorry, I’ve been off the grid for a while.
Yes, it seems to still be an issue. I’ve not tried with a Radeon VII yet mind you; maybe AMD fixed the issue in that. That said, rebooting the VM seems to work with my reference Sapphire V64, but a complete shutdown of the VM and then start up will not work.

I have been meaning to go back an look at the user input in the script to see what instruction I can pass it. Right now it puts the system to sleep when sudo systemctl suspend is executed.
I need to figure out how to get the system to wake up on mouse or keyboard input. Something must be turned off in the settings that’s preventing user inputs to wake the system.

You may be getting the error

“1”
tee: ‘/sys/bus/pci/devices/0000:20:00.0/remove’: Invalid argument
“1”
tee: ‘/sys/bus/pci/devices/0000:20:00.1/remove’: Invalid argument

because, the VM was shutdown for too long before the calls were made. There is a time limit on those calls before the card goes into the state that causes the reset bug. You would literally need to run the script as soon as the VM is shutdown for it to work.

I would need to double check this because I haven’t used the VM on my workstation with the Vega card in quite a bit of time. I also had to change the motherboard on my gaming/HTPC rig and have switch to using the Vega card in Linux and passing Nvidia to the VM in that system.

i suspect GPU Passthrough will slowly stop being as popular because of WINE, Steam Proton, DXVK, and Lutris making a large and growing collection of Windows game now work in Linux using DXVK