Navi Reset Kernel Patch

This is known, please see:

1 Like

what kernel version did you use? and what method of rebooting the VM is successfully resetting your Navi?

5.5.9-arch1-2-vfio it’s from the AUR with the ACS patch pre-applied.

I just boot and shutdown using virt-manager.

1 Like

not sure if this information helps in any way but i have some results to share.

my host machine is running Debian 10 with Linux 4.19.98 (with this patch applied).
using this patch to try to reboot a windows 10 VM will crash any VMs that have a GPU being passed through, not just the Windows one in question.

rebooting a Linux VM succeeds, but the guest will then have very corrupt output.

here is the dmesg errors when trying to reboot a windows VM:

[  635.877423] vfio-pci 0000:45:00.0: Navi10: performing BACO reset
[  637.147436] AMD-Vi: Completion-Wait loop timed out
[  637.278569] AMD-Vi: Completion-Wait loop timed out
[  637.406478] AMD-Vi: Completion-Wait loop timed out
[  637.533778] AMD-Vi: Completion-Wait loop timed out
[  637.664053] AMD-Vi: Completion-Wait loop timed out
[  638.005654] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4424e0]
[  639.007530] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442510]
[  639.137244] AMD-Vi: Completion-Wait loop timed out
[  639.747377] AMD-Vi: Completion-Wait loop timed out
[  639.897399] AMD-Vi: Completion-Wait loop timed out
[  640.009407] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442540]
[  640.136746] AMD-Vi: Completion-Wait loop timed out
[  640.264107] AMD-Vi: Completion-Wait loop timed out
[  640.391465] AMD-Vi: Completion-Wait loop timed out
[  640.518821] AMD-Vi: Completion-Wait loop timed out
[  640.646180] AMD-Vi: Completion-Wait loop timed out
[  640.773542] AMD-Vi: Completion-Wait loop timed out
[  640.906648] AMD-Vi: Completion-Wait loop timed out
[  641.011281] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442570]
[  641.165749] AMD-Vi: Completion-Wait loop timed out
[  641.293643] AMD-Vi: Completion-Wait loop timed out
[  641.421099] AMD-Vi: Completion-Wait loop timed out
[  641.548583] AMD-Vi: Completion-Wait loop timed out
[  641.676547] AMD-Vi: Completion-Wait loop timed out
[  641.806746] AMD-Vi: Completion-Wait loop timed out
[  641.936562] AMD-Vi: Completion-Wait loop timed out
[  642.013160] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4425a0]
[  642.140908] AMD-Vi: Completion-Wait loop timed out
[  642.273063] AMD-Vi: Completion-Wait loop timed out
[  642.403892] AMD-Vi: Completion-Wait loop timed out
[  642.532516] AMD-Vi: Completion-Wait loop timed out
[  642.660051] AMD-Vi: Completion-Wait loop timed out
[  642.787755] AMD-Vi: Completion-Wait loop timed out
[  642.915490] AMD-Vi: Completion-Wait loop timed out
[  643.015046] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4425d0]
[  643.142850] AMD-Vi: Completion-Wait loop timed out
[  643.270299] AMD-Vi: Completion-Wait loop timed out
[  643.397679] AMD-Vi: Completion-Wait loop timed out
[  643.529444] AMD-Vi: Completion-Wait loop timed out
[  643.667902] AMD-Vi: Completion-Wait loop timed out
[  644.017057] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442600]
[  645.018934] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442630]
[  645.600143] AMD-Vi: Completion-Wait loop timed out
[  645.727897] AMD-Vi: Completion-Wait loop timed out
[  645.856112] AMD-Vi: Completion-Wait loop timed out
[  645.983732] AMD-Vi: Completion-Wait loop timed out
[  646.020698] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442660]
[  646.021634] do_IRQ: 9.37 No irq handler for vector
[  646.148241] AMD-Vi: Completion-Wait loop timed out
[  646.275862] AMD-Vi: Completion-Wait loop timed out
[  647.022564] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442690]
[  648.024461] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4426c0]
[  649.026318] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4426f0]
[  650.028214] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442720]
[  651.030096] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442750]
[  652.031978] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442780]
[  653.033851] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4427b0]
[  654.035737] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4427e0]
[  655.037593] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442810]
[  656.039478] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442840]
[  657.041371] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442870]
[  658.043256] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4428a0]
[  659.045128] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4428d0]
[  660.046994] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442900]
[  661.048892] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442930]
[  662.050748] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442960]
[  663.052649] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442990]
[  664.054508] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4429c0]
[  665.056389] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e4429f0]
[  666.058268] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442a20]
[  667.060165] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442a50]
[  668.062047] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442a80]
[  669.063929] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442ab0]
[  670.065800] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442ae0]
[  671.067669] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442b10]
[  672.069564] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442b40]
[  673.071445] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442b70]
[  674.073307] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442ba0]
[  675.075188] iommu ivhd1: Event logged [IOTLB_INV_TIMEOUT device=45:00.0 address=0x000000203e442bd0]

and here is the dmesg errors when trying to reboot a Linux VM:

[  100.916386] vfio-pci 0000:48:00.0: Navi10: performing BACO reset
[  100.918443] vfio-pci 0000:48:00.0: Navi10: SMU error 0xff (line 3894)
[  101.911999] vfio-pci 0000:48:00.0: Navi10: sol register = 0x76de255
[  101.912275] vfio_ecap_init: 0000:48:00.0 hiding ecap 0x19@0x270
[  101.912291] vfio_ecap_init: 0000:48:00.0 hiding ecap 0x1b@0x2d0
[  101.912298] vfio_ecap_init: 0000:48:00.0 hiding ecap 0x25@0x400
[  101.912301] vfio_ecap_init: 0000:48:00.0 hiding ecap 0x26@0x410
[  101.912304] vfio_ecap_init: 0000:48:00.0 hiding ecap 0x27@0x440
[  101.983009] vfio-pci 0000:48:00.0: Navi10: performing BACO reset
[  101.985073] vfio-pci 0000:48:00.0: Navi10: SMU error 0xff (line 3894)
[  102.978634] vfio-pci 0000:48:00.0: Navi10: sol register = 0x77b3966

i have no idea why the VMs have different issues, the graphic cards in question are the exact same model number.

I did the same thing, am getting an RX 5700 XT to replace my r9 390X.

Yes I can confirm this change is needed in 5.6 kernels:

mmio = ioremap_nocache(mmio_base, mmio_size);
should be replaced with
mmio = ioremap(mmio_base, mmio_size);

I needed this change to build linux-hardened 5.6.5.a-1 in Manjaro

3 Likes

has any progress been made since the 2nd patch was posted? or is this project at a complete standstill unless AMD decides to help?

will there be a 3rd patch posted in the foreseeable future?

I was having problems compiling 5.6.3-arch1-1 a while ago and was getting errors relating to ioremap_nocache. I need to look into this again…

GPU reset fixes are set for the 5.8 kernel. Check phoronix for further details and a source, the post is titled.

Many AMD Radeon Graphics Driver Improvements Sent In For Linux 5.8

2 Likes

Ditto
5.6.5 git using elrepo 5.6.5 .config.
applied patch, compiled, got error on ioremap. Made edit above.

Got “Bar 0: can’t reserve” error now (found fix here and reddit/sof):

video=efifb:off

had to power cycle after prior crash and warm-reboots - discovered this may be related to my bios (zenith extreme II) auto detecting the 5700XT was connected to a monitor and the titanV in the first PCIe slot was not and sending console to the navi despite modprobe.d/vfio.conf… (the console switched 1/2 way through the boot) hmm, might have to add some initrd config to my setup to stop that…

passthrough and reset appear to be working now.

Now the issues is that it seems the kvm/vendor_id mechanisms to hide KVM from the guest have all changed since the last time I played with this…

Again, this is totally useless for us. AMD driver fixes for resets do not apply to VFIO as we do not use the amdgpu driver in the host. AMD need to fix this properly in silicon instead of continuing to hack in “fixes” that are not fixes.

7 Likes

Running a Sapphire 5700 XT and the v2 patch on 5.4.28, reboots seem to be really broken.

[  747.627949] vfio-pci 0000:0c:00.0: enabling device (0400 -> 0403)
[  747.628073] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  747.630144] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  748.631678] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b51b0
[  748.631915] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  748.631926] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  748.631930] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  748.631931] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  748.631938] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  748.716082] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  748.718154] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  749.719682] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b5260

I felt like it worked fine-ish before, but it’s messed up now. No idea what changed.

1 Like

I’ve found a double reboot helps to reset things, possibly only when first done in a Windows VM (and then Linux after)? It’s stupidly intermittent but I’ve had 0xfd SMU errors and GPU still worked.

Hey everybody,
I’m trying to compile the Manjaro kernel with the patch and always end up with “saving rejects to file drivers/pci/quirks.c.rej”. The weird thing is, that it compiled before with the same kernel repo (5.4) and now it’s not working anymore. Does it matter where I place the patch after prepare() in the PKGBUILD?

Tried that already? Navi Reset Kernel Patch

Thanks, that did the trick! I’m still confused how I got it to compile with linux54 yesterday and today it’s not working anymore. Maybe because I upgraded to 5.6 via GUI and then downgraded to 5.4 again. I don’t get it…

Is there a way to make this patch work in VMWare ESXI?

No need to compile a Kernel yourself on Manjaro. The patch has been included in the Manjaro Kernel since 5.3.

The first version of the patch is included in the Manjaro kernel, not the second one.

With or without the patch, I get huge amounts of spam in dmesg after trying to reboot the VM:

[  639.747377] AMD-Vi: Completion-Wait loop timed out

and the VM won’t turn on. Whether or not I suspend to RAM then starting X11 with the GPU doesn’t work (logs say that there aren’t any screens), and if I suspend to RAM and try to start the VM I get 800x600 output with the color channels swapped.