Navi Reset Kernel Patch

KaneTW · April 30, 2020, 6:07pm

Running a Sapphire 5700 XT and the v2 patch on 5.4.28, reboots seem to be really broken.

[  747.627949] vfio-pci 0000:0c:00.0: enabling device (0400 -> 0403)
[  747.628073] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  747.630144] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  748.631678] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b51b0
[  748.631915] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  748.631926] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  748.631930] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  748.631931] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  748.631938] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  748.716082] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  748.718154] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  749.719682] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b5260

I felt like it worked fine-ish before, but it’s messed up now. No idea what changed.

KaneTW · May 1, 2020, 8:30am

I’ve found a double reboot helps to reset things, possibly only when first done in a Windows VM (and then Linux after)? It’s stupidly intermittent but I’ve had 0xfd SMU errors and GPU still worked.

cd39nm · May 2, 2020, 12:11pm

Hey everybody,
I’m trying to compile the Manjaro kernel with the patch and always end up with “saving rejects to file drivers/pci/quirks.c.rej”. The weird thing is, that it compiled before with the same kernel repo (5.4) and now it’s not working anymore. Does it matter where I place the patch after prepare() in the PKGBUILD?

modzilla · May 2, 2020, 12:53pm

Tried that already? Navi Reset Kernel Patch

cd39nm · May 2, 2020, 1:11pm

Thanks, that did the trick! I’m still confused how I got it to compile with linux54 yesterday and today it’s not working anymore. Maybe because I upgraded to 5.6 via GUI and then downgraded to 5.4 again. I don’t get it…

NicKF · May 3, 2020, 7:30am

Is there a way to make this patch work in VMWare ESXI?

BigBlueHouse · May 3, 2020, 12:46pm

No need to compile a Kernel yourself on Manjaro. The patch has been included in the Manjaro Kernel since 5.3.

cd39nm · May 4, 2020, 2:13pm

The first version of the patch is included in the Manjaro kernel, not the second one.

leo60228 · May 5, 2020, 7:30pm

With or without the patch, I get huge amounts of spam in dmesg after trying to reboot the VM:

[  639.747377] AMD-Vi: Completion-Wait loop timed out

and the VM won’t turn on. Whether or not I suspend to RAM then starting X11 with the GPU doesn’t work (logs say that there aren’t any screens), and if I suspend to RAM and try to start the VM I get 800x600 output with the color channels swapped.

gnif · May 6, 2020, 11:31am

These are the after-effects of a failed reset

PloniAlmoni · May 7, 2020, 9:55pm

Patch ESXi? Ask VMware to give the source or beg them to do it.

leo60228 · May 18, 2020, 2:29pm

Here’s a complete dmesg (boot -> VM boot -> VM reboot -> VM shutdown). https://gist.github.com/leo60228/31ff8ac8be08a404678d21034b9bc79d

gnif · May 19, 2020, 2:08am

This is all useless to me… these are the effects of the failure and without input from AMD there is nothing anyone can provide short of a magic bullet (ie magic register to poke/prod) that will progress this.

leo60228 · May 19, 2020, 2:50pm

So there’s no way I could help debug this? Oh well…

cd39nm · May 21, 2020, 5:28pm

I just patched Kernel 5.7 on Manjaro and even though my host still crashes after shutting down the VM I get no errors in journalctl. I’m just wondering, because I got the usual errors when I used other patched kernels and I’d like to know how this could possibly be debugged any other way?

surxenberg · May 22, 2020, 12:22pm

Does this effectively mean that, at least for Navi10 GPUs, we’re stuck with an incomplete implementation of the reset routine for good?

No offense m8, I appreciate your work and I am very grateful for it; it’s just that macOS is still broken. macOS can boot once and everything even works just fine, but macOS’ driver puts the graphics card into a state it cannot recover from. If you want to reboot macOS or shut down macOS and boot into Windows you will have to perform a hard reset on the host. My current - and probably final - workaround to this is using a dedicated RX480 which does not exhibit this problem.

Log · May 22, 2020, 6:12pm

Correct.

molayab · May 22, 2020, 10:33pm

It seems that the ioremap_nocache were changed on latest kernel updates (tested on 5.6.0) to ioremap_cache

gnif · May 23, 2020, 2:30am

No, that is wrong and has been covered here further up. It has been removed as it was an alias of ‘ioremap’ which you should be using instead. Using caching will break things.

aliaj00 · May 24, 2020, 1:40am

Hi Guys,

i am using centos8 with amd 5700xt i did the:

<features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vendor_id state="on" value="1002"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
  </features>

and for kernel:

set default_kernelopts="root=/dev/mapper/cl-root ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap quiet loglevel=7 systemd.log_level=debug nouveau.modeset=0 amd_iommu=on iommu=pt video=efifb:off "

it works but on shudown of win10 it gives the error …127. i tried to install kernel-ml 5.6-14 and it does not even boot sget stuck on Starting switch Root … and hangs there.

is there any patch i can make to resolve the reset bug.

regards,

Tim