Navi Reset Kernel Patch

wendell · September 13, 2019, 10:54pm

Navi Reset

What this does is use powerplay tables to turn the card off and back on again. (Insert IT crowd meme here, haha).

@gnif has done it again. Good work!!

Ok, here is the navi reset. Navi is easier to implement as the SMU has most of the logic in it.

Update 27-11-2019: Updated patch down below: Navi Reset Kernel Patch

From 69ea42207b544b6e3fa9755022bff09d2ce953d9 Mon Sep 17 00:00:00 2001
From: Geoffrey McRae <[email protected]>
Date: Thu, 12 Sep 2019 03:19:28 +1000
Subject: [PATCH] pci quirk: AMD Navi 10 series vendor specific reset

Signed-off-by: Geoffrey McRae <[email protected]>
---
 drivers/pci/quirks.c | 98 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 98 insertions(+)

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 44c4ae1abd00..d94ddb1c6832 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3825,6 +3825,97 @@ static int delay_250ms_after_flr(struct pci_dev *dev, int probe)
 	return 0;
 }
 
+/*
+ * AMD Navi 10 series GPUs require a vendor specific reset procedure.
+ * According to AMD a PSP mode 2 reset should be enough however at this
+ * time the details of how to perform this are not available to us.
+ * Instead we can signal the SMU to enter and exit BACO which has the same
+ * desired effect.
+ */
+static int reset_amd_navi10(struct pci_dev *dev, int probe)
+{
+	const int mmMP0_SMN_C2PMSG_81 = 0x16091;
+	const int mmMP1_SMN_C2PMSG_66 = 0x16282;
+	const int mmMP1_SMN_C2PMSG_82 = 0x16292;
+	const int mmMP1_SMN_C2PMSG_90 = 0x1629a;
+
+	u16 cfg;
+	resource_size_t mmio_base, mmio_size;
+	uint32_t __iomem * mmio;
+	unsigned int sol;
+	unsigned int timeout;
+
+	/* bus resets still cause navi to flake out */
+	dev->dev_flags |= PCI_DEV_FLAGS_NO_BUS_RESET;
+
+	if (probe)
+		return 0;
+
+	/* save the PCI state and enable memory access */
+	pci_save_state(dev);
+	pci_read_config_word(dev, PCI_COMMAND, &cfg);
+	pci_write_config_word(dev, PCI_COMMAND, cfg | PCI_COMMAND_MEMORY);
+
+	/* map BAR5 */
+	mmio_base = pci_resource_start(dev, 5);
+	mmio_size = pci_resource_len(dev, 5);
+	mmio = ioremap_nocache(mmio_base, mmio_size);
+	if (mmio == NULL) {
+		pci_disable_device(dev);
+		pci_err(dev, "Navi10: cannot iomap device\n");
+		return 0;
+	}
+
+	/* check the sign of life indicator */
+	sol = readl(mmio + mmMP0_SMN_C2PMSG_81);
+	pci_info(dev, "Navi10: SOL 0x%x\n", sol);
+	if (sol == 0 || sol == 0xffffffff) {
+		pci_info(dev, "Navi10: device doesn't need to be reset\n");
+		goto out;
+	}
+
+	pci_info(dev, "Navi10: performing BACO reset\n");
+
+	/* the SMU might be busy already, wait for it */
+	for(timeout = 200; timeout && readl(mmio + mmMP1_SMN_C2PMSG_90) != 0; --timeout)
+		msleep(1);
+	readl(mmio + mmMP1_SMN_C2PMSG_90);
+
+	/* send PPSMC_MSG_ArmD3 */
+	writel(0x00, mmio + mmMP1_SMN_C2PMSG_90);
+	writel(0x46, mmio + mmMP1_SMN_C2PMSG_66);
+	for(timeout = 200; timeout && readl(mmio + mmMP1_SMN_C2PMSG_90) != 0; --timeout)
+		msleep(1);
+
+	/* send PPSMC_MSG_EnterBaco with param */
+	writel(0x00, mmio + mmMP1_SMN_C2PMSG_90);
+	writel(0x00, mmio + mmMP1_SMN_C2PMSG_82);
+	writel(0x18, mmio + mmMP1_SMN_C2PMSG_66);
+	for(timeout = 200; timeout && readl(mmio + mmMP1_SMN_C2PMSG_90) != 0; --timeout)
+		msleep(1);
+
+	/* wait for the regulators to shutdown */
+	msleep(400);
+
+	/* send PPSMC_MSG_ExitBaco */
+	writel(0x00, mmio + mmMP1_SMN_C2PMSG_90);
+	writel(0x19, mmio + mmMP1_SMN_C2PMSG_66);
+	for(timeout = 200; timeout && readl(mmio + mmMP1_SMN_C2PMSG_90) != 0; --timeout)
+		msleep(1);
+
+	/* wait for regulators to startup again */
+	msleep(400);
+
+out:
+	/* unmap BAR5 */
+	iounmap(mmio);
+
+	/* restore the PCI state and command register */
+	pci_restore_state(dev);
+	pci_write_config_word(dev, PCI_COMMAND, cfg);
+	return 0;
+}
+
 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 	{ PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF,
 		 reset_intel_82599_sfp_virtfn },
@@ -3836,6 +3927,13 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 	{ PCI_VENDOR_ID_INTEL, 0x0953, delay_250ms_after_flr },
 	{ PCI_VENDOR_ID_CHELSIO, PCI_ANY_ID,
 		reset_chelsio_generic_dev },
+	{ PCI_VENDOR_ID_ATI, 0x7310, reset_amd_navi10 },
+	{ PCI_VENDOR_ID_ATI, 0x7312, reset_amd_navi10 },
+	{ PCI_VENDOR_ID_ATI, 0x7318, reset_amd_navi10 },
+	{ PCI_VENDOR_ID_ATI, 0x7319, reset_amd_navi10 },
+	{ PCI_VENDOR_ID_ATI, 0x731a, reset_amd_navi10 },
+	{ PCI_VENDOR_ID_ATI, 0x731b, reset_amd_navi10 },
+	{ PCI_VENDOR_ID_ATI, 0x731f, reset_amd_navi10 },
 	{ 0 }
 };
 
-- 
2.20.1

wendell · September 13, 2019, 11:00pm

reserved

gnif · September 13, 2019, 11:59pm

For those that would like to support my work on things like this please see the below options:

Ko-Fi
Patreon
Paypal
BTC - 14ZFcYjsKPiVreHqcaekvHGL846u3ZuT13

FurryJackman · September 14, 2019, 1:48pm

Any word about a Radeon VII reset patch? Would this work with Vega 20 if you changed the PCI IDs?

gnif · September 15, 2019, 12:46am

I don’t have a Radeon VII to implement this on, and it almost certainly wont work on the Vega 20, the Vega reset implementation is far more complex and some base register addresses have been changed.

FurryJackman · September 15, 2019, 5:58pm

Would be the logical next step. Anyone have a Radeon VII to give to Geoff?

Novasty · September 15, 2019, 6:12pm

You could provide it to him as a temp and he sends it back after he is done.

FurryJackman · September 15, 2019, 6:43pm

I know someone from the forums with a Radeon VII locally. I could ask him about it.

anon85976236 · September 16, 2019, 12:17pm

Is there something I need to have in mind in regards to the setup. I took the 5.3-rc1 Manjaro kernel from here: https://gitlab.manjaro.org/packages/core/linux53 and edited the PKGBUILD so it incorporates the new patch.

Now when I do a clean shutdown and try to restart the VM with my Navi passed through the boot hangs: I can see the TianoCore boot logo and the endless rotating windows loading icon below it.

I followed the directions in the arch wiki to setuo passthrough on Manjaro.

My VM Setup is the following (removed)

Edit: Don’t know what was wrong. I did destroy the VM and kept the disc image and created a new VM with the very same image. Windows got a error saying it needs to restart and now everything works.

anon85976236 · September 21, 2019, 1:37pm

Hi,

PSA: Currently as of 10.11.2019 and Kernel 5.3.8 the Patch is included into Manjaro stable. If you use Manjaro with the 5.3. Kernel the Patch is already included!

These threads from the Manjaro forum explained it for me:

I have done a self compilation of the Kernel the first time myself for this patch. It is not much work on Manjaro and Arch in general.

Find the correct sources: First you clone the repositories of the kernel source for the kernel you want to use from here: https://gitlab.manjaro.org/packages/core. So in this page I would recommend you go into the linux53 branch and then hit the blue clone-button and copy the “Clone via https”-link. You need the package “git” installed on your system.
Clone the code: Use the terminal navigate to a folder of your choice and run “git clone https://gitlab.manjaro.org/packages/core/linux53.git”. You will have a new linux53 folder. These are the sources for the 5.3 Manjaro Kernel.
Copy the code from the Navi patch into a file in the linux53 folder and name it navi.patch or something like this.
Create a SHA256 Checksum of the “navi.patch” file with "shasum -a 256 and keep the checksum it generates
Open the PKGBUILD file in the linux53 folder in a text editor:
Change the line _kernelname=-MANJARO to something like _kernelname=-VFIO. You do this because it makes it easier to see that you use your patched Kernel when yo run uname -u ~~and also avoids the case that when Manjaro publishes a newer kernel version that yours will be replaced with a newer but unpatched version.~~
At the beginning of the file there is a list of files beginning with “source=(” and then listing all the files. At the End behind “‘0013-bootsplash.patch’” create an additional line before the braces close with ‘navi.patch’ or whatever you named your patch file.
Directly below the sources part in the PKGBUILD file is the “sha256sums=(”-part with the sha256 checksums. Because you added the navi.patch as the last file in the sources section you need to add the checksum you generated here as the last checksum before the braces close in the same manner.

At this point the PKGBUILD will know about your patch and can verify it with the checksum you generated. Now we need to apply it.

Behind the end of the

patch -Np1 -i “${srcdir}/vfs-ino.patch”

lines add an additional line like this:

  patch -Np1 -i "${srcdir}/navi.patch"`

Save and apply your edits
Install manjaro-tools-pkg sudo pacman -S manjaro-tools-pkg
outside of the linux53 folder run buildpkg -p linux53
The compilation takes 2 hours for me -> wait
in the terminal cd /var/cache/manjaro-tools/pkg/stable/x86_64 your new kernel and it’s header will be there
Install both with sudo pacman -U <kernelfilename> <headerfilename>
Reboot and choose your new kernel in the grub menu by holding shift while booting
?
Profit

Edit: I wanted to add that packages you install from the Arch User Repository (AUR) on Arch or Manjaro are being build in the same way. You should familiarize yourselfs with the build steps because you can verify and edit alls the software from the AUR if you understand buildpkg.

Edit2: Currently as of 10.11.2019 and Kernel 5.3.8 the Patch is included into Manjaro stable. If you use Manjaro with the 5.3. Kernel the Patch is already included!

modzilla · October 22, 2019, 9:01pm

Hey guys,

I have a RX 5700 and applied the patch to the kernel, but whenever I try to restart the VM, it always spits out that stuff to dmesg:

[53019.066294] vfio-pci 0000:0b:00.0: Navi10: SOL 0xffffffff
[53019.066295] vfio-pci 0000:0b:00.0: Navi10: device doesn't need to be reset
[53019.066689] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
[53019.066708] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
[53019.066715] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
[53019.066717] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
[53019.066719] vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
.
.
.
[53026.500510] AMD-Vi: Completion-Wait loop timed out
[53026.680801] AMD-Vi: Completion-Wait loop timed out
[53026.827330] AMD-Vi: Completion-Wait loop timed out
[53026.968287] AMD-Vi: Completion-Wait loop timed out
[53027.094991] AMD-Vi: Completion-Wait loop timed out
[53027.260391] AMD-Vi: Completion-Wait loop timed out
[53027.316206] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x7fb5724b0]
[53026.500510] AMD-Vi: Completion-Wait loop timed out
[53026.680801] AMD-Vi: Completion-Wait loop timed out
[53026.827330] AMD-Vi: Completion-Wait loop timed out
[53026.968287] AMD-Vi: Completion-Wait loop timed out
[53027.094991] AMD-Vi: Completion-Wait loop timed out
[53027.260391] AMD-Vi: Completion-Wait loop timed out
[53027.316206] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x7fb5724b0]
[53027.563953] AMD-Vi: Completion-Wait loop timed out
[53028.047223] AMD-Vi: Completion-Wait loop timed out
[53028.318122] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x7fb572500]
[53029.320002] iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x7fb5725d0]
[53064.854092] vfio-pci 0000:0b:00.0: Navi10: SOL 0x0
[53064.854095] vfio-pci 0000:0b:00.0: Navi10: device doesn't need to be reset

Do I need to use qemu 4.1? I’m using Debian Buster as my OS btw. (but with a 5.3.7 kernel).

gnif · October 23, 2019, 3:56am

By any chance are you running a threadripper?

modzilla · October 23, 2019, 7:33am

Sorry I didn’t state that! It’s a R1600 on an AB350N-Gaming WIFI (mITX) (so it’s the primary and I have to use a vbios file)

awmath · October 23, 2019, 12:47pm

I’ve got the exact same problem with a Ryzen 2700. Since I’m running a 5.3.6 kernel with this patch, I wonder if its related to the kernel version.
I can however tell you that it’s not related to the RX5700 being the primary GPU or not. I tried both: initialized as second GPU and using it without a vbios-file and initialized as primary gpu using a vbios file. Both work for the first time and fail afterwards with the above mentioned dmesg output.

modzilla · October 24, 2019, 7:21am

Thanks, that is good to know! I tried the 5.2.sth with the patch, too, but that’s not the cause either. However I always get the message, that the patch ended unexpectedly or something like that when I apply the patch. Is that to be expected?

What distro are you using?

awmath · October 24, 2019, 9:07pm

I’m using Fedora 30.

One really “funny” thing. I can indeed start a VM with the RX5700 repeatedly with the following procedure:
Let the UEFI claim the RX5700 on boot. Pass the GPU over to the vfio-pci driver after boot but keep the efi-framebuffer runing on the GPU. Then try to start a VM (im my case Win10). It will crash as the framebuffer is still connected to the GPU. Disconnect the efi framebuffer and start the VM.
It will start now BUT the Windows AMD driver won’t load even though the Windows device manager and tools like GPU-Z will recognize the GPU as RX5700.
I can now shutdown and start the VM at will. But every time the AMD driver won’t load and I’m stuck at 800x600.

McSlang · October 25, 2019, 2:41am

Would it be possible to get a .patch file for this update? As with the vega fix

gnif · October 25, 2019, 4:41am

When I get to it as with all things open source and free, they happen as life and time allows

McSlang · October 25, 2019, 6:09pm

I appreciate it! Just installed a win vm with looking glass, and am finally free of microsoft! Very happy to know that its in the queue even if its a bit far off lol

modzilla · October 26, 2019, 4:30pm

hm that’s indeed weird…

Do you have any idea what’s going on @gnif?