AMD Polaris, Vega & Navi Reset Project - vendor-reset

gnif · November 14, 2020, 8:27pm

The project https://github.com/gnif/vendor-reset is a collaboration between @belfrypossum and myself. It aims to provide an avenue for easily adding complex reset sequences to the kernel without needing to upstream them into the kernel itself.

Today both @belfrypossum and I have agreed that the project is ready for use by the general public and would like to announce that it completely succeeds the prior released patches for AMD GPU resets. Currently the project targets (note this is not an exhaustive list and only a few example GPUs for each ASIC are listed here):

Polaris 10, 11 & 12
Vega 10 (Vega56/64/FE)
Vega 20 (Radeon 7)
Navi 10 (5600XT, 5700, 5700XT)
Navi 12 (Pro 5600M)
Navi 14 (Pro 5300, RX 5300, 5500XT)

Usage is very simple, just build the module and modprobe it, or use dmks to manage it directly (configuration is included). Nothing more is needed.

There are still conditions under which the GPUs will not reset however we are working to improve them as time permits.

This entirely removes the need to patch your kernel, and it is required that any patches you have applied for GPU resets be removed when using this module.

If you would like to support this project, due to the amount of time and hardware that @belfrypossum has invested I will not be accepting donations for this project at this time. However, you can show support to @belfrypossum on KoFi here for his amazing work:

mathew2214 · November 14, 2020, 8:54pm

does this module have to be modprobed? will it not function if built directly into the kernel at build time?

im one of those crazy people that likes to have everything i need in my kernel from the start. so i dont have to deal with dkms everytime i build myself a new kernel.

gnif · November 14, 2020, 8:55pm

I have not tested an in-tree build so YMMV, but it should be fine.

mathew2214 · November 14, 2020, 10:29pm

i have built this module into my kernel and attempted to reboot my Navi10 VM. it failed.
here is my dmesg immediately after shutting down the VM:

 292.251722] AMD-Vi: Completion-Wait loop timed out
[  292.396401] AMD-Vi: Completion-Wait loop timed out
[  292.525060] AMD-Vi: Completion-Wait loop timed out
[  292.653521] AMD-Vi: Completion-Wait loop timed out
[  292.781893] AMD-Vi: Completion-Wait loop timed out
[  292.910239] AMD-Vi: Completion-Wait loop timed out
[  293.038503] AMD-Vi: Completion-Wait loop timed out
[  293.108857] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de03540]
[  293.195431] vfio-pci 0000:45:00.1: can't change power state from D3cold to D0 (config space inaccessible)
[  293.195917] ixgbevf 0000:0a:10.6: enabling device (0000 -> 0002)
[  293.196099] ixgbe 0000:0a:00.0 enp10s0f0: VF Reset msg received from vf 3
[  293.206989] ixgbevf 0000:0a:10.6: MAC address not assigned by administrator.
[  293.206993] ixgbevf 0000:0a:10.6: Assigning random MAC address
[  293.207933] ixgbevf 0000:0a:10.6: 3a:64:b2:5a:2d:0b
[  293.207938] ixgbevf 0000:0a:10.6: MAC: 3
[  293.207940] ixgbevf 0000:0a:10.6: Intel(R) X550 Virtual Function
[  293.209885] ixgbevf 0000:0a:10.6 enp10s0f0v3: renamed from eth0
[  294.110723] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de035a0]
[  295.112599] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de035d0]
[  296.114484] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de03610]
[  296.114497] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de03630]

attempting to restart the VM yields the unknown PCI header 127 error. confirming a failed reset.

Debian 10 (Buster) running on a debianized Linux 5.8-10 with the vendor-reset module built-in.
AMD Threadripper 1950x
ASUS Prime X399-a with firmware 1002.

it should be noted that this does not hang all VMs on the host, which the BACO patch did. this module only seems to break the VM attempting a reset.

gnif · November 15, 2020, 12:25am

The patch did not operate or is not loaded, can you please check your dmesg for the text hook installed?

gnif · November 15, 2020, 1:11am

The withdrawn post is the installation message, it was correct. However I can see you loaded it way too late, the module must be loaded as early as possible. The defaut reset the kernel performs breaks the GPU completely, you must have vendor-reset loaded first.

mathew2214 · November 15, 2020, 1:19am

i withdrew the post because i noticed a possible error on my end. vendor_reset loads AFTER vfio. i will report my results once i reconfigure to have vendor_reset load BEFORE vfio.

gnif · November 15, 2020, 1:23am

It doesn’t need to load before vfio_pci, it just has to be loaded before a reset is attempted, which is at VM power on or reboot.

mathew2214 · November 15, 2020, 1:35am

adding vendor_reset to /etc/modules panics the kernel on boot.
adding modprobe vendor_reset to /etc/rc.local also panics the kernel on boot.

I’m all out of ways to automatically ensure this module gets loaded.

gnif · November 15, 2020, 1:36am

any panic is not good… can you show the panic please? or ideally the entire dmesg output?

mathew2214 · November 15, 2020, 1:40am

i have no means to get the dmesg of a panicked kernel, as i mever could get kdump to work.
the best i could do is take a picture of the screen with my cellphone. so here’s the tail end of the panic.

gnif · November 15, 2020, 1:42am

Thanks, that is helpful even still, will get back to you

gnif · November 15, 2020, 1:44am

Sorry, what GPU is this?

mathew2214 · November 15, 2020, 1:46am

AMD Radeon RX5700 (not XT). [1002:731f] revision c4.

gnif · November 15, 2020, 1:48am

can you please provide the file hook.o from your build?

mathew2214 · November 15, 2020, 1:53am

linux-5.8-10-source-root/drivers/vendor_reset/src/hook.o
i used google drive as the forums forbid the uploading of a .o file.

mathew2214 · November 15, 2020, 2:07am

looking through my stuff. it seems i have used vendor_reset instead of vendor-reset in a few places. i will correct all instances of this, recompile my kernel and report back.

gnif · November 15, 2020, 2:09am

Oh wait, this is in tree isnt it?, ie not a module. From what I can tell the symbols were not resolved and it’s stuck in an infinite loop.

mathew2214 · November 15, 2020, 2:26am

well, i tried everything i could think of. cant get the panicing to not happen. it seems this module simply doesn’t work when built in-tree.