Navi Reset Kernel Patch

skois · February 8, 2020, 10:05pm

Looks like good news!
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-Navi-GPU-Reset

gnif · February 9, 2020, 1:16am

unfortunatly not. This is reset and recovery for the AMDGPU driver, not VFIO. AMD are refusing to implement standard GPU reset, or implement it as a PCI reset quirk independent of the AMDGPU driver.

Until you see a patch that affects quirks.c these “improvements” are useless to us.

skois · February 11, 2020, 6:09am

Ahh. Cr@p! And here i was thinking that AMD woke up and trying to get some GPU market also…

surxenberg · February 12, 2020, 10:50am

Quick question, are you working on another update of the Navi10 reset bug patch?

gnif · February 12, 2020, 11:07am

Always, however the progress is painstakingly slow due to lack of information on technical details covered by NDAs

RipClaw · February 14, 2020, 6:03pm

Couple of questions …

Why do they refuse to implement ‘standard GPU reset’?
a) Is it a Hardware limitation or
b) Hardware is OK, but software has architectural issues & it is not economical to fix it ?
How well do the workstation & server cards (MxGPU ones) behave in this regard ? Got any idea ?

gnif · February 14, 2020, 11:36pm

Ask AMD.

Just as bad unless you’re splitting a SR-IOV card into functional devices, which can be reset, just not the root device.

nicoboss · February 15, 2020, 11:31pm

I can confirm that this patch is working perfectly fine on my Zenith Extreme II + Threadripper 3970X + RX 5700 XT Proxmox VE setup with Windows Server 2019 as gust. To get it working I just cloned the latest Proxmox kernel and applied the latest Navi patch and everything just worked! I’m using this setup as my main PC so getting that working was very important for me. Thank you so much for all your effort.

Querzion · February 16, 2020, 4:32pm

So 5700 XT is fixed? Is 5600 XT fixed too? I have been going about it if I am to buy a RTX 2060 or a 5600 XT and it’s a difference of at least 50 Euro’s. So The 5600 XT would be the better choice for me because of economy, but then it’s the Reset-Bug & vBIOS update, which makes me a bit nervous, but RTX 2060 isn’t really getting me any less wondrous in my head.

wendell · February 16, 2020, 5:46pm

It’s fixed in some limited scenarios. It can work but is still unfortunately a headache

Querzion · February 16, 2020, 7:56pm

Okay. So on a timetable? If I get a 5600 XT in about say 2 month’s, do you think that it’s been resolved till then? ONE BIG question though. Is the Navi cards backwards compatible with PCIe 3.0, because I’m still on a old AX370-Gaming K5 motherboard.

You say in some limited… So specific hardware and distros then? Is there a specific thread that can advance the knowledge in what the reset-bug does specifically? Because I’m just using association at the moment to the word reset, but what I have heard so far is that it doesn’t fully empty the gpu memory on reboot, making it crash because of not enough memory on the next vm-start, so you have to reboot the host system in order to resolve it. That sucks.

Well, thanks anyway.

gnif · February 18, 2020, 3:37am

Nope, there is no timetable. We are at the mercy of AMD’s Legal department at this point.

Yes, all PCIe 4.0 devices MUST be backwards compatible as part of the PCI 4.0 specification or they can not claim to be a PCIe device.

It’s simple, the GPU won’t reset until you physically restart the entire PC. The net result varies, either you can’t use the GPU for passthrough at all as the BIOS posts the GPU before you can give it to the VM. Or you can post the VM once, but restarting the VM requires a GPU reset and you’re stuck again until you reset the entire PC.

mathew2214 · February 18, 2020, 1:43pm

would it count as a reset if the 8pin PCI power connectors were physically removed and reinserted? (while the physical machine is still running)
is there any risk of physical damage to doing this? i dont want to risk destroying any hardware.
i could create a software-controlled device that breaks and reconnects that power circuit. im just asking if there’s any hope of that working, so that i dont waste time on a fruitless effort.

BomB191 · February 18, 2020, 1:53pm

Has anyone managed to slam this into unraid?

I’m not finding the most useful information on walking me through compiling a system

jvolkman · February 18, 2020, 5:00pm

In my experience a simple sleep/wake also works: sudo rtcwake -m mem --seconds 10

gnif · February 18, 2020, 5:08pm

No, do NOT do this, you can destroy your GPU, or your motherboard, or both. Even if you could fully turn off the GPU with a switch it still would be of no use, the reset has to be negotiated with the PCIe bus controller so it can re-scan the device after the reset. Just cutting power would cause the PCIe bus controller to have a fit and flag the port as dead, or just hang the entire system.

Yes, people have been doing this for some time as it’s effectively the same thing as resetting the PC. Note it’s not a simple sleep, it’s full hibernation which can cause all sorts of issues for various devices under linux (ie, many laptops running Linux fail to wakeup from hibernation correctly).

Also if you like me run multiple VMs on the system, a sleep/wake event could never be a viable option as it will affect all VMs and interrupt services.

jvolkman · February 19, 2020, 5:46am

Seems like this amdgpu patch from a few months ago added some logic to the Navi BACO reset that’s not captured in the quirks patch. @gnif do you know if either the mmBIF_DOORBELL_INT_CNTL or mmTHM_BACO_CNTL manipulation would be relevant?

gnif · February 19, 2020, 6:47am

No, I had missed that one. I will get it ported over tonight if I can and see how it goes.

lightfoot256 · February 19, 2020, 3:07pm

I have a B450-F Gaming board with an AMD 3600 and am trying to get an MSI rx5700 to pass through; Have literally just tried the v2 patch on the 5.3.18 kernel to no avail; I can confirm the “BACO reset” in the proxmox logs but the guest fails to boot until I restart the host (I didn’t know about the rtcwake hack – which does work) haven’t tried the original patch yet or reviewed the changes just posted a few hours ago.

Compiling and running takes about ~40 minutes so happy to try any experiments. Any advice on workflow for developing the patch? Is there any way to save part of the build process? seems wasteful rebuilding the lot from scratch.

I tried the setup in ESXi first; and tried to force the d3d0 reset fix via “passthru.map” also to no avail.

Switched to AMD to get away from Nvidia’s annoying 43 issue and feel like I’ve jumped out the frying pan into the fire.

jvolkman · February 19, 2020, 3:33pm

Cool.

Another thing I noticed is that amdgpu’s smu_send_smc_msg is really just smu_send_smc_msg_with_param with a parameter value 0. So to be equivalent to amdgpu’s BACO exit, the quirks patch would need an additional writel(0x00, mmio + mmMP1_SMN_C2PMSG_82) in the exit block. Maybe this doesn’t matter.