Navi Reset Kernel Patch

The patch is not complete and is pending insights from AMD. Recovery from a Linux guest driver unload at this time doesn’t work at all.

Arh ok. Just as I suspected. I will keep a close eye on this forum and for any updates. Really appreciate your work on this.

I can confirm that the same is the case with macOS. Booting once into macOS is fine, after a reboot I will still get the Tiano Core logo, Clover and all that but once the actual driver is loaded in macOS I will get a blackscreen and, in case of booting Windows (after having booted & shutdown macOS) I will get stuck in a boot loop.

As long as it’s just Windows it’s fine, but once I have booted macOS I will have to reset the host if I want to go back to Windows or reboot macOS.

My intermittent solution to this problem is using an RX 470 which I got off ebay. As long as amdgpu is blacklisted on the host, all Polaris cards will work very nicely, even on macOS. not sure what happens if the amdgpu driver is loaded on the host, as I am using the host exclusively as a headless machine.

In any case, thank you very much for your work. I try to avoid nVidia as far as I can and your patch made this possible.

Great thanks for the patch.
I just want to add, that if you have permanent stutters and slowness, even in radeon software scrolling, you could try to add

pcie_aspm=performance

to the kernel commandline.
This thing also fixed quite unplayable stutters in Jedi Fallen Order with 5700xt@Adrenalin 2020 Edition 20.1.4
But smaller issues are still there - possible not so good driver.
Well, at least, it is playable now

Looks like good news!
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-Navi-GPU-Reset

1 Like

unfortunatly not. This is reset and recovery for the AMDGPU driver, not VFIO. AMD are refusing to implement standard GPU reset, or implement it as a PCI reset quirk independent of the AMDGPU driver.

Until you see a patch that affects quirks.c these “improvements” are useless to us.

Ahh. Cr@p! And here i was thinking that AMD woke up and trying to get some GPU market also…

Quick question, are you working on another update of the Navi10 reset bug patch?

Always, however the progress is painstakingly slow due to lack of information on technical details covered by NDAs

3 Likes

Couple of questions …

  1. Why do they refuse to implement ‘standard GPU reset’?
    a) Is it a Hardware limitation or
    b) Hardware is OK, but software has architectural issues & it is not economical to fix it ?
  2. How well do the workstation & server cards (MxGPU ones) behave in this regard ? Got any idea ?

Ask AMD.

Just as bad unless you’re splitting a SR-IOV card into functional devices, which can be reset, just not the root device.

1 Like

I can confirm that this patch is working perfectly fine on my Zenith Extreme II + Threadripper 3970X + RX 5700 XT Proxmox VE setup with Windows Server 2019 as gust. To get it working I just cloned the latest Proxmox kernel and applied the latest Navi patch and everything just worked! I’m using this setup as my main PC so getting that working was very important for me. Thank you so much for all your effort.

2 Likes

So 5700 XT is fixed? Is 5600 XT fixed too? I have been going about it if I am to buy a RTX 2060 or a 5600 XT and it’s a difference of at least 50 Euro’s. So The 5600 XT would be the better choice for me because of economy, but then it’s the Reset-Bug & vBIOS update, which makes me a bit nervous, but RTX 2060 isn’t really getting me any less wondrous in my head.

It’s fixed in some limited scenarios. It can work but is still unfortunately a headache

1 Like

Okay. So on a timetable? If I get a 5600 XT in about say 2 month’s, do you think that it’s been resolved till then? ONE BIG question though. Is the Navi cards backwards compatible with PCIe 3.0, because I’m still on a old AX370-Gaming K5 motherboard.

You say in some limited… So specific hardware and distros then? Is there a specific thread that can advance the knowledge in what the reset-bug does specifically? Because I’m just using association at the moment to the word reset, but what I have heard so far is that it doesn’t fully empty the gpu memory on reboot, making it crash because of not enough memory on the next vm-start, so you have to reboot the host system in order to resolve it. That sucks.

Well, thanks anyway.

Nope, there is no timetable. We are at the mercy of AMD’s Legal department at this point.

Yes, all PCIe 4.0 devices MUST be backwards compatible as part of the PCI 4.0 specification or they can not claim to be a PCIe device.

It’s simple, the GPU won’t reset until you physically restart the entire PC. The net result varies, either you can’t use the GPU for passthrough at all as the BIOS posts the GPU before you can give it to the VM. Or you can post the VM once, but restarting the VM requires a GPU reset and you’re stuck again until you reset the entire PC.

1 Like

would it count as a reset if the 8pin PCI power connectors were physically removed and reinserted? (while the physical machine is still running)
is there any risk of physical damage to doing this? i dont want to risk destroying any hardware.
i could create a software-controlled device that breaks and reconnects that power circuit. im just asking if there’s any hope of that working, so that i dont waste time on a fruitless effort.

Has anyone managed to slam this into unraid?

I’m not finding the most useful information on walking me through compiling a system

In my experience a simple sleep/wake also works: sudo rtcwake -m mem --seconds 10

No, do NOT do this, you can destroy your GPU, or your motherboard, or both. Even if you could fully turn off the GPU with a switch it still would be of no use, the reset has to be negotiated with the PCIe bus controller so it can re-scan the device after the reset. Just cutting power would cause the PCIe bus controller to have a fit and flag the port as dead, or just hang the entire system.

Yes, people have been doing this for some time as it’s effectively the same thing as resetting the PC. Note it’s not a simple sleep, it’s full hibernation which can cause all sorts of issues for various devices under linux (ie, many laptops running Linux fail to wakeup from hibernation correctly).

Also if you like me run multiple VMs on the system, a sleep/wake event could never be a viable option as it will affect all VMs and interrupt services.

1 Like