AMD Polaris, Vega & Navi Reset Project - vendor-reset

We have a similar report in discord for out of tree, I am investigating.

1 Like

I have made some changes in git that may help but really I canā€™t see why they would. Itā€™s getting stuck searching the device db, which is a simple array search. Can you please update and try again?

1 Like

doesnā€™t seem to have made a difference.
hereā€™s the image of the panic after the new update though.

1 Like

as much as I hate to ask this as I have doubts as to if it will help, but can you please try to do a make clean before you build just incase we have an old .o hanging around? The code here is extremely simple and I canā€™t explain what could cause it to get stuck here.

2 Likes

i have been cleaning my builds before each attempt. thats just a habit i have.

1 Like

Can you please try with tail call optimisation disabled? -fno-optimize-sibling-calls. If this works we have a solution :slight_smile:

Edit: just pushed this change in anyway as we are pretty sure itā€™s the solution. Simply update and rebuild to see :slight_smile:

2 Likes

A gcc bug? Iā€™ve tripped over that one too elsewhere!

2 Likes

i can confirm, this new update allows the kernel to bootup with the module added in /etc/modules.
now i can finally test the actual reset behavior.

1 Like

it seems to reset my Navi10 just fine.
however, somehow, virsh autostart no longer works. the VM boots, but no output from the GPU on the guest. the only way i get GPU output form my guest is by virsh destroy and then a virsh start
rather annoying, as im so used to my passthrough working fine from bootup.

here are some results from my testing.
virsh destroy and then virsh start: works perfectly.
shutdown called from guest OS: reset fails.
virsh shutdown and then virsh start: reset fails.
reboot called from guest OS: reset succeeds, but no GPU output.
virsh reboot: reset succeeds, but no GPU output.

these tests were done with a guest running Debian 11 Bullseye on a debianized Linux 5.7.6 with Valveā€™s futex-wait-multiple patch applied.

No, not a ā€œgcc bugā€, but an assumption about how the hooks work. TCO replaces a call with a jump when it can to avoid stack usage for return addresses, normally this is fine, but since we are patching a kernel symbol we need the return instruction pointer kept in tact.

@mathew2214 can you please provide the dmesg for this?

1 Like

here is the log from all of these tests. my apologies, i diddnt think to record the timestamps of when each test was performed. so it may be impossible to tell what test caused what dmesg output here.
yes im aware my ECC setup is broken.

VM autostart seems to begin at line 2014.
tessa0dmesg.txt (223.1 KB)

What is your guest OS?

1 Like

Debian 11 Bullseye with debianized Linux 5.7.6 with Valveā€™s futex-wait-multiple patch applied.

Thanks, we have not much tested Linux as a guest with these resets, we do know that between Windows, OSX, and Linux the GPU state after shutdown is lackingā€¦ consistencyā€¦

I donā€™t see anything obvious but these sequences are going to see improvements over the coming weeks as we tweak and tune them, so stay tuned! :smiley:

2 Likes

I would love to give this project a try.
Can you tell whether this would work on a stock openSuSE kernel?

When I try to install it using dkms, the output I get is as follows:

sudo dkms install .
Creating symlink /var/lib/dkms/vendor-reset/0.0.17/source ->
/usr/src/vendor-reset-0.0.17

DKMS: add completed.
Error! echo
Your kernel headers for kernel 5.8.14-1-default cannot be found at
/lib/modules/5.8.14-1-default/build or /lib/modules/5.8.14-1-default/source.

install your kernel headers. should be a package. consult your distributions documentation.

1 Like

Thank you, Iā€™ll look into it :eyes:

Just tried this on Fedora 33 with latest un-patched kernel 5.9.8-200.fc33.x86_64 for a Navi 10 RX5700XT, and can report that it works great.

Thank you!

Just in case anyone is looking for the extra steps required to ensure early load on Fedora, I used:

# echo "vendor-reset" > /etc/modules-load.d/vendor-reset.conf
# dracut --add-drivers "vfio vfio-pci vfio_iommu_type1 vendor-reset" --force
# lsinitrd | grep -e vfio -e vendor

Last command should show .ko.xz files for both vfio and vendor-reset.

5 Likes

While the first version of belfrypossums version did not work for me, this finished module works perfectly.

Thanks a lot for this!

Edit: Manjaro with 5.9 Kernel and Powercolor 5700XT

1 Like

Very nice :slight_smile:

So with this project, there is no need to patch the distros kernel anymore?

I will give it a try on my GNU/Linux Ubuntu 20.04.1 :grin: