Navi Reset Kernel Patch

Running a Sapphire 5700 XT and the v2 patch on 5.4.28, reboots seem to be really broken.

[  747.627949] vfio-pci 0000:0c:00.0: enabling device (0400 -> 0403)
[  747.628073] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  747.630144] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  748.631678] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b51b0
[  748.631915] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[  748.631926] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[  748.631930] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[  748.631931] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[  748.631938] vfio-pci 0000:0c:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
[  748.716082] vfio-pci 0000:0c:00.0: Navi10: performing BACO reset
[  748.718154] vfio-pci 0000:0c:00.0: Navi10: SMU error 0xff (line 3983)
[  749.719682] vfio-pci 0000:0c:00.0: Navi10: sol register = 0x52b5260

I felt like it worked fine-ish before, but itā€™s messed up now. No idea what changed.

1 Like

Iā€™ve found a double reboot helps to reset things, possibly only when first done in a Windows VM (and then Linux after)? Itā€™s stupidly intermittent but Iā€™ve had 0xfd SMU errors and GPU still worked.

Hey everybody,
Iā€™m trying to compile the Manjaro kernel with the patch and always end up with ā€œsaving rejects to file drivers/pci/quirks.c.rejā€. The weird thing is, that it compiled before with the same kernel repo (5.4) and now itā€™s not working anymore. Does it matter where I place the patch after prepare() in the PKGBUILD?

Tried that already? Navi Reset Kernel Patch

Thanks, that did the trick! Iā€™m still confused how I got it to compile with linux54 yesterday and today itā€™s not working anymore. Maybe because I upgraded to 5.6 via GUI and then downgraded to 5.4 again. I donā€™t get itā€¦

Is there a way to make this patch work in VMWare ESXI?

No need to compile a Kernel yourself on Manjaro. The patch has been included in the Manjaro Kernel since 5.3.

The first version of the patch is included in the Manjaro kernel, not the second one.

With or without the patch, I get huge amounts of spam in dmesg after trying to reboot the VM:

[  639.747377] AMD-Vi: Completion-Wait loop timed out

and the VM wonā€™t turn on. Whether or not I suspend to RAM then starting X11 with the GPU doesnā€™t work (logs say that there arenā€™t any screens), and if I suspend to RAM and try to start the VM I get 800x600 output with the color channels swapped.

These are the after-effects of a failed reset

2 Likes

Patch ESXi? Ask VMware to give the source or beg them to do it.

Hereā€™s a complete dmesg (boot -> VM boot -> VM reboot -> VM shutdown). https://gist.github.com/leo60228/31ff8ac8be08a404678d21034b9bc79d

This is all useless to meā€¦ these are the effects of the failure and without input from AMD there is nothing anyone can provide short of a magic bullet (ie magic register to poke/prod) that will progress this.

1 Like

So thereā€™s no way I could help debug this? Oh wellā€¦

I just patched Kernel 5.7 on Manjaro and even though my host still crashes after shutting down the VM I get no errors in journalctl. Iā€™m just wondering, because I got the usual errors when I used other patched kernels and Iā€™d like to know how this could possibly be debugged any other way?

Does this effectively mean that, at least for Navi10 GPUs, weā€™re stuck with an incomplete implementation of the reset routine for good?

No offense m8, I appreciate your work and I am very grateful for it; itā€™s just that macOS is still broken. macOS can boot once and everything even works just fine, but macOSā€™ driver puts the graphics card into a state it cannot recover from. If you want to reboot macOS or shut down macOS and boot into Windows you will have to perform a hard reset on the host. My current - and probably final - workaround to this is using a dedicated RX480 which does not exhibit this problem.

Correct.

2 Likes

It seems that the ioremap_nocache were changed on latest kernel updates (tested on 5.6.0) to ioremap_cache

No, that is wrong and has been covered here further up. It has been removed as it was an alias of ā€˜ioremapā€™ which you should be using instead. Using caching will break things.

2 Likes

Hi Guys,

i am using centos8 with amd 5700xt i did the:

<features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="8191"/>
      <vendor_id state="on" value="1002"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <vmport state="off"/>
  </features>

and for kernel:

set default_kernelopts="root=/dev/mapper/cl-root ro crashkernel=auto resume=/dev/mapper/cl-swap rd.lvm.lv=cl/root rd.lvm.lv=cl/swap quiet loglevel=7 systemd.log_level=debug nouveau.modeset=0 amd_iommu=on iommu=pt video=efifb:off "

it works but on shudown of win10 it gives the error ā€¦127. i tried to install kernel-ml 5.6-14 and it does not even boot sget stuck on Starting switch Root ā€¦ and hangs there.

is there any patch i can make to resolve the reset bug.

regards,

Tim