Navi Reset Kernel Patch

I did the same thing, am getting an RX 5700 XT to replace my r9 390X.

Yes I can confirm this change is needed in 5.6 kernels:

mmio = ioremap_nocache(mmio_base, mmio_size);
should be replaced with
mmio = ioremap(mmio_base, mmio_size);

I needed this change to build linux-hardened 5.6.5.a-1 in Manjaro

3 Likes

has any progress been made since the 2nd patch was posted? or is this project at a complete standstill unless AMD decides to help?

will there be a 3rd patch posted in the foreseeable future?

I was having problems compiling 5.6.3-arch1-1 a while ago and was getting errors relating to ioremap_nocache. I need to look into this again…

GPU reset fixes are set for the 5.8 kernel. Check phoronix for further details and a source, the post is titled.

Many AMD Radeon Graphics Driver Improvements Sent In For Linux 5.8

2 Likes

Ditto
5.6.5 git using elrepo 5.6.5 .config.
applied patch, compiled, got error on ioremap. Made edit above.

Got “Bar 0: can’t reserve” error now (found fix here and reddit/sof):

video=efifb:off

had to power cycle after prior crash and warm-reboots - discovered this may be related to my bios (zenith extreme II) auto detecting the 5700XT was connected to a monitor and the titanV in the first PCIe slot was not and sending console to the navi despite modprobe.d/vfio.conf… (the console switched 1/2 way through the boot) hmm, might have to add some initrd config to my setup to stop that…

passthrough and reset appear to be working now.

Now the issues is that it seems the kvm/vendor_id mechanisms to hide KVM from the guest have all changed since the last time I played with this…

Again, this is totally useless for us. AMD driver fixes for resets do not apply to VFIO as we do not use the amdgpu driver in the host. AMD need to fix this properly in silicon instead of continuing to hack in “fixes” that are not fixes.

7 Likes

Running a Sapphire 5700 XT and the v2 patch on 5.4.28, reboots seem to be really broken.

[  747.627949] vfio-pci 0000:0c:00.0: enabling device (0400 -> 0403)
[  747.628073] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  747.630144] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  748.631678] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b51b0
[  748.631915] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  748.631926] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  748.631930] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  748.631931] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  748.631938] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  748.716082] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  748.718154] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  749.719682] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b5260

I felt like it worked fine-ish before, but it’s messed up now. No idea what changed.

1 Like

I’ve found a double reboot helps to reset things, possibly only when first done in a Windows VM (and then Linux after)? It’s stupidly intermittent but I’ve had 0xfd SMU errors and GPU still worked.

Hey everybody,
I’m trying to compile the Manjaro kernel with the patch and always end up with “saving rejects to file drivers/pci/quirks.c.rej”. The weird thing is, that it compiled before with the same kernel repo (5.4) and now it’s not working anymore. Does it matter where I place the patch after prepare() in the PKGBUILD?

Tried that already? Navi Reset Kernel Patch

Thanks, that did the trick! I’m still confused how I got it to compile with linux54 yesterday and today it’s not working anymore. Maybe because I upgraded to 5.6 via GUI and then downgraded to 5.4 again. I don’t get it…

Is there a way to make this patch work in VMWare ESXI?

No need to compile a Kernel yourself on Manjaro. The patch has been included in the Manjaro Kernel since 5.3.

The first version of the patch is included in the Manjaro kernel, not the second one.

With or without the patch, I get huge amounts of spam in dmesg after trying to reboot the VM:

[  639.747377] AMD-Vi: Completion-Wait loop timed out

and the VM won’t turn on. Whether or not I suspend to RAM then starting X11 with the GPU doesn’t work (logs say that there aren’t any screens), and if I suspend to RAM and try to start the VM I get 800x600 output with the color channels swapped.

These are the after-effects of a failed reset

2 Likes

Patch ESXi? Ask VMware to give the source or beg them to do it.

Here’s a complete dmesg (boot -> VM boot -> VM reboot -> VM shutdown). https://gist.github.com/leo60228/31ff8ac8be08a404678d21034b9bc79d

This is all useless to me… these are the effects of the failure and without input from AMD there is nothing anyone can provide short of a magic bullet (ie magic register to poke/prod) that will progress this.

1 Like