The project https://github.com/gnif/vendor-reset is a collaboration between @belfrypossum and myself. It aims to provide an avenue for easily adding complex reset sequences to the kernel without needing to upstream them into the kernel itself.
Today both @belfrypossum and I have agreed that the project is ready for use by the general public and would like to announce that it completely succeeds the prior released patches for AMD GPU resets. Currently the project targets (note this is not an exhaustive list and only a few example GPUs for each ASIC are listed here):
Polaris 10, 11 & 12
Vega 10 (Vega56/64/FE)
Vega 20 (Radeon 7)
Navi 10 (5600XT, 5700, 5700XT)
Navi 12 (Pro 5600M)
Navi 14 (Pro 5300, RX 5300, 5500XT)
Usage is very simple, just build the module and modprobe it, or use dmks to manage it directly (configuration is included). Nothing more is needed.
There are still conditions under which the GPUs will not reset however we are working to improve them as time permits.
This entirely removes the need to patch your kernel, and it is required that any patches you have applied for GPU resets be removed when using this module.
If you would like to support this project, due to the amount of time and hardware that @belfrypossum has invested I will not be accepting donations for this project at this time. However, you can show support to @belfrypossum on KoFi here for his amazing work:
does this module have to be modprobed? will it not function if built directly into the kernel at build time?
im one of those crazy people that likes to have everything i need in my kernel from the start. so i dont have to deal with dkms everytime i build myself a new kernel.
i have built this module into my kernel and attempted to reboot my Navi10 VM. it failed.
here is my dmesg immediately after shutting down the VM:
292.251722] AMD-Vi: Completion-Wait loop timed out
[ 292.396401] AMD-Vi: Completion-Wait loop timed out
[ 292.525060] AMD-Vi: Completion-Wait loop timed out
[ 292.653521] AMD-Vi: Completion-Wait loop timed out
[ 292.781893] AMD-Vi: Completion-Wait loop timed out
[ 292.910239] AMD-Vi: Completion-Wait loop timed out
[ 293.038503] AMD-Vi: Completion-Wait loop timed out
[ 293.108857] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de03540]
[ 293.195431] vfio-pci 0000:45:00.1: can't change power state from D3cold to D0 (config space inaccessible)
[ 293.195917] ixgbevf 0000:0a:10.6: enabling device (0000 -> 0002)
[ 293.196099] ixgbe 0000:0a:00.0 enp10s0f0: VF Reset msg received from vf 3
[ 293.206989] ixgbevf 0000:0a:10.6: MAC address not assigned by administrator.
[ 293.206993] ixgbevf 0000:0a:10.6: Assigning random MAC address
[ 293.207933] ixgbevf 0000:0a:10.6: 3a:64:b2:5a:2d:0b
[ 293.207938] ixgbevf 0000:0a:10.6: MAC: 3
[ 293.207940] ixgbevf 0000:0a:10.6: Intel(R) X550 Virtual Function
[ 293.209885] ixgbevf 0000:0a:10.6 enp10s0f0v3: renamed from eth0
[ 294.110723] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de035a0]
[ 295.112599] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de035d0]
[ 296.114484] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de03610]
[ 296.114497] iommu ivhd1: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x103de03630]
attempting to restart the VM yields the unknown PCI header 127 error. confirming a failed reset.
Debian 10 (Buster) running on a debianized Linux 5.8-10 with the vendor-reset module built-in.
AMD Threadripper 1950x
ASUS Prime X399-a with firmware 1002.
it should be noted that this does not hang all VMs on the host, which the BACO patch did. this module only seems to break the VM attempting a reset.
The withdrawn post is the installation message, it was correct. However I can see you loaded it way too late, the module must be loaded as early as possible. The defaut reset the kernel performs breaks the GPU completely, you must have vendor-reset loaded first.
i withdrew the post because i noticed a possible error on my end. vendor_reset loads AFTER vfio. i will report my results once i reconfigure to have vendor_reset load BEFORE vfio.
i have no means to get the dmesg of a panicked kernel, as i mever could get kdump to work.
the best i could do is take a picture of the screen with my cellphone. so here’s the tail end of the panic.
looking through my stuff. it seems i have used vendor_reset instead of vendor-reset in a few places. i will correct all instances of this, recompile my kernel and report back.