Looks like good news!
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.6-Navi-GPU-Reset
unfortunatly not. This is reset and recovery for the AMDGPU driver, not VFIO. AMD are refusing to implement standard GPU reset, or implement it as a PCI reset quirk independent of the AMDGPU driver.
Until you see a patch that affects quirks.c
these âimprovementsâ are useless to us.
Ahh. Cr@p! And here i was thinking that AMD woke up and trying to get some GPU market alsoâŚ
Quick question, are you working on another update of the Navi10 reset bug patch?
Always, however the progress is painstakingly slow due to lack of information on technical details covered by NDAs
Couple of questions âŚ
- Why do they refuse to implement âstandard GPU resetâ?
a) Is it a Hardware limitation or
b) Hardware is OK, but software has architectural issues & it is not economical to fix it ? - How well do the workstation & server cards (MxGPU ones) behave in this regard ? Got any idea ?
Ask AMD.
Just as bad unless youâre splitting a SR-IOV card into functional devices, which can be reset, just not the root device.
I can confirm that this patch is working perfectly fine on my Zenith Extreme II + Threadripper 3970X + RX 5700 XT Proxmox VE setup with Windows Server 2019 as gust. To get it working I just cloned the latest Proxmox kernel and applied the latest Navi patch and everything just worked! Iâm using this setup as my main PC so getting that working was very important for me. Thank you so much for all your effort.
So 5700 XT is fixed? Is 5600 XT fixed too? I have been going about it if I am to buy a RTX 2060 or a 5600 XT and itâs a difference of at least 50 Euroâs. So The 5600 XT would be the better choice for me because of economy, but then itâs the Reset-Bug & vBIOS update, which makes me a bit nervous, but RTX 2060 isnât really getting me any less wondrous in my head.
Itâs fixed in some limited scenarios. It can work but is still unfortunately a headache
Okay. So on a timetable? If I get a 5600 XT in about say 2 monthâs, do you think that itâs been resolved till then? ONE BIG question though. Is the Navi cards backwards compatible with PCIe 3.0, because Iâm still on a old AX370-Gaming K5 motherboard.
You say in some limited⌠So specific hardware and distros then? Is there a specific thread that can advance the knowledge in what the reset-bug does specifically? Because Iâm just using association at the moment to the word reset, but what I have heard so far is that it doesnât fully empty the gpu memory on reboot, making it crash because of not enough memory on the next vm-start, so you have to reboot the host system in order to resolve it. That sucks.
Well, thanks anyway.
Nope, there is no timetable. We are at the mercy of AMDâs Legal department at this point.
Yes, all PCIe 4.0 devices MUST be backwards compatible as part of the PCI 4.0 specification or they can not claim to be a PCIe device.
Itâs simple, the GPU wonât reset until you physically restart the entire PC. The net result varies, either you canât use the GPU for passthrough at all as the BIOS posts the GPU before you can give it to the VM. Or you can post the VM once, but restarting the VM requires a GPU reset and youâre stuck again until you reset the entire PC.
would it count as a reset if the 8pin PCI power connectors were physically removed and reinserted? (while the physical machine is still running)
is there any risk of physical damage to doing this? i dont want to risk destroying any hardware.
i could create a software-controlled device that breaks and reconnects that power circuit. im just asking if thereâs any hope of that working, so that i dont waste time on a fruitless effort.
Has anyone managed to slam this into unraid?
Iâm not finding the most useful information on walking me through compiling a system
In my experience a simple sleep/wake also works: sudo rtcwake -m mem --seconds 10
No, do NOT do this, you can destroy your GPU, or your motherboard, or both. Even if you could fully turn off the GPU with a switch it still would be of no use, the reset has to be negotiated with the PCIe bus controller so it can re-scan the device after the reset. Just cutting power would cause the PCIe bus controller to have a fit and flag the port as dead, or just hang the entire system.
Yes, people have been doing this for some time as itâs effectively the same thing as resetting the PC. Note itâs not a simple
sleep, itâs full hibernation which can cause all sorts of issues for various devices under linux (ie, many laptops running Linux fail to wakeup from hibernation correctly).
Also if you like me run multiple VMs on the system, a sleep/wake event could never be a viable option as it will affect all VMs and interrupt services.
Seems like this amdgpu patch from a few months ago added some logic to the Navi BACO reset thatâs not captured in the quirks patch. @gnif do you know if either the mmBIF_DOORBELL_INT_CNTL
or mmTHM_BACO_CNTL
manipulation would be relevant?
No, I had missed that one. I will get it ported over tonight if I can and see how it goes.
I have a B450-F Gaming board with an AMD 3600 and am trying to get an MSI rx5700 to pass through; Have literally just tried the v2 patch on the 5.3.18 kernel to no avail; I can confirm the âBACO resetâ in the proxmox logs but the guest fails to boot until I restart the host (I didnât know about the rtcwake
hack â which does work) havenât tried the original patch yet or reviewed the changes just posted a few hours ago.
Compiling and running takes about ~40 minutes so happy to try any experiments. Any advice on workflow for developing the patch? Is there any way to save part of the build process? seems wasteful rebuilding the lot from scratch.
I tried the setup in ESXi first; and tried to force the d3d0 reset fix via âpassthru.mapâ also to no avail.
Switched to AMD to get away from Nvidiaâs annoying 43 issue and feel like Iâve jumped out the frying pan into the fire.
Cool.
Another thing I noticed is that amdgpuâs smu_send_smc_msg
is really just smu_send_smc_msg_with_param
with a parameter value 0
. So to be equivalent to amdgpuâs BACO exit, the quirks patch would need an additional writel(0x00, mmio + mmMP1_SMN_C2PMSG_82)
in the exit block. Maybe this doesnât matter.