Navi Reset Kernel Patch

It’s fixed in some limited scenarios. It can work but is still unfortunately a headache

1 Like

Okay. So on a timetable? If I get a 5600 XT in about say 2 month’s, do you think that it’s been resolved till then? ONE BIG question though. Is the Navi cards backwards compatible with PCIe 3.0, because I’m still on a old AX370-Gaming K5 motherboard.

You say in some limited… So specific hardware and distros then? Is there a specific thread that can advance the knowledge in what the reset-bug does specifically? Because I’m just using association at the moment to the word reset, but what I have heard so far is that it doesn’t fully empty the gpu memory on reboot, making it crash because of not enough memory on the next vm-start, so you have to reboot the host system in order to resolve it. That sucks.

Well, thanks anyway.

Nope, there is no timetable. We are at the mercy of AMD’s Legal department at this point.

Yes, all PCIe 4.0 devices MUST be backwards compatible as part of the PCI 4.0 specification or they can not claim to be a PCIe device.

It’s simple, the GPU won’t reset until you physically restart the entire PC. The net result varies, either you can’t use the GPU for passthrough at all as the BIOS posts the GPU before you can give it to the VM. Or you can post the VM once, but restarting the VM requires a GPU reset and you’re stuck again until you reset the entire PC.

1 Like

would it count as a reset if the 8pin PCI power connectors were physically removed and reinserted? (while the physical machine is still running)
is there any risk of physical damage to doing this? i dont want to risk destroying any hardware.
i could create a software-controlled device that breaks and reconnects that power circuit. im just asking if there’s any hope of that working, so that i dont waste time on a fruitless effort.

Has anyone managed to slam this into unraid?

I’m not finding the most useful information on walking me through compiling a system

In my experience a simple sleep/wake also works: sudo rtcwake -m mem --seconds 10

No, do NOT do this, you can destroy your GPU, or your motherboard, or both. Even if you could fully turn off the GPU with a switch it still would be of no use, the reset has to be negotiated with the PCIe bus controller so it can re-scan the device after the reset. Just cutting power would cause the PCIe bus controller to have a fit and flag the port as dead, or just hang the entire system.

Yes, people have been doing this for some time as it’s effectively the same thing as resetting the PC. Note it’s not a simple sleep, it’s full hibernation which can cause all sorts of issues for various devices under linux (ie, many laptops running Linux fail to wakeup from hibernation correctly).

Also if you like me run multiple VMs on the system, a sleep/wake event could never be a viable option as it will affect all VMs and interrupt services.

1 Like

Seems like this amdgpu patch from a few months ago added some logic to the Navi BACO reset that’s not captured in the quirks patch. @gnif do you know if either the mmBIF_DOORBELL_INT_CNTL or mmTHM_BACO_CNTL manipulation would be relevant?

1 Like

No, I had missed that one. I will get it ported over tonight if I can and see how it goes.

I have a B450-F Gaming board with an AMD 3600 and am trying to get an MSI rx5700 to pass through; Have literally just tried the v2 patch on the 5.3.18 kernel to no avail; I can confirm the “BACO reset” in the proxmox logs but the guest fails to boot until I restart the host (I didn’t know about the rtcwake hack – which does work) haven’t tried the original patch yet or reviewed the changes just posted a few hours ago.

Compiling and running takes about ~40 minutes so happy to try any experiments. Any advice on workflow for developing the patch? Is there any way to save part of the build process? seems wasteful rebuilding the lot from scratch.

I tried the setup in ESXi first; and tried to force the d3d0 reset fix via “passthru.map” also to no avail.

Switched to AMD to get away from Nvidia’s annoying 43 issue and feel like I’ve jumped out the frying pan into the fire.

Cool.

Another thing I noticed is that amdgpu’s smu_send_smc_msg is really just smu_send_smc_msg_with_param with a parameter value 0. So to be equivalent to amdgpu’s BACO exit, the quirks patch would need an additional writel(0x00, mmio + mmMP1_SMN_C2PMSG_82) in the exit block. Maybe this doesn’t matter.

1 Like

Yes, but my contacts have confirmed that this parameter is ignored and can be omitted for the other SMC commands.

1 Like

I added some extra logging and lengthened the delays to see what route it was taking; and it looks like it only does the baco reset once… and then never tries again since the sign of life indicator is then 0x00 and it just jumps past it each time…

When I do a power off and on of the guest I can hear the card make a huff noise; assuming the fans are momentarily stopping and starting…

I didn’t print the first value; but assume its >0 the first time but not -1 since I don’t see the warning.

Will try and do more investigation later/tomorrow.

2 Likes

Did you add the video card’s room to your VM? I have to do that no matter what, even if I don’t pass the boot GPU through.

Got the rom from GPU-Z; and passed through – makes no difference; works first time after host boot/wake and then never again after that, If I comment out that condition that jumps to the out label I then get SMU error 0x0 (line 4029) errors for each of the waits. I guess it isn’t ready :D.

Couple of things I noticed:

  1. The final exit doesn’t write 0x0 to mmMP1_SMN_C2PMSG_82 – but looking at other examples of this process elsewhere in the kernel the send without wait doesn’t either.
  2. The pci_save_state happens inside the block that gets jumped over – so in the case when SOL==0x0, the pci_restore_state happens when it wasn’t saved first – does it matter?
  3. What is the difference between SOC and SOL; they look like the same thing?

I’m still a bit confused about how this is working…

  1. Get memory access to send commands
  2. Set power state to d0 to put into low power mode
  3. Wait until we can send messages
  4. if( sol != 0 ){
  5. Save state
  6. Send Arm d3 and wait
  7. Send BACO enter and wait for a response
  8. Long delay
  9. Send BACO exit and wait for a response
  10. Wait for the SOC to become valid
    }
  11. Restore state

In my case; it appears like the subsequent VM starts don’t have the GPU in the required state to even reset it… and trying to reset it results in SMU errors when you try since it’s not responding. I’ve added some more logging and just waiting for a rebuild; assuming the guest itself has a lot to answer for depending on what state it leaves the GPU in when it shuts down…

The log from the various steps:

# first boot of guest

Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: enabling device (0400 -> 0403)
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.1: enabling device (0000 -> 0002)
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.

# stopping the guest:

Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x7e1c805
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: performing BACO reset
Feb 20 11:49:36 amdtest pvedaemon[1618]: starting vnc proxy UPID:amdtest:00000652:000034B9:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest pvedaemon[1142]: <root@pam> starting task UPID:amdtest:00000652:000034B9:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest pvedaemon[1620]: starting vnc proxy UPID:amdtest:00000654:000034BA:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest pvedaemon[1141]: <root@pam> starting task UPID:amdtest:00000654:000034BA:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest qmeventd[793]:  OK
Feb 20 11:49:36 amdtest qmeventd[793]: Finished cleanup for 100
Feb 20 11:49:36 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:0000063D:0000340D:5E4E724E:qmstop:100:root@pam: OK
Feb 20 11:49:37 amdtest qm[1622]: VM 100 qmp command failed - VM 100 not running
Feb 20 11:49:37 amdtest pvedaemon[1618]: Failed to run vncproxy.
Feb 20 11:49:37 amdtest qm[1623]: VM 100 qmp command failed - VM 100 not running
Feb 20 11:49:37 amdtest pvedaemon[1142]: <root@pam> end task UPID:amdtest:00000652:000034B9:5E4E7250:vncproxy:100:root@pam: Failed to run vncproxy.
Feb 20 11:49:37 amdtest pvedaemon[1620]: Failed to run vncproxy.
Feb 20 11:49:37 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:00000654:000034BA:5E4E7250:vncproxy:100:root@pam: Failed to run vncproxy.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC Valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:49:40 amdtest systemd[1]: 100.scope: Succeeded.

# and then starting the guest:

Feb 20 11:50:00 amdtest systemd[1]: Starting Proxmox VE replication runner...
Feb 20 11:50:01 amdtest systemd[1]: pvesr.service: Succeeded.
Feb 20 11:50:01 amdtest systemd[1]: Started Proxmox VE replication runner.
Feb 20 11:50:52 amdtest pvedaemon[1788]: starting vnc proxy UPID:amdtest:000006FC:00005288:5E4E729C:vncproxy:100:root@pam:
Feb 20 11:50:52 amdtest pvedaemon[1141]: <root@pam> starting task UPID:amdtest:000006FC:00005288:5E4E729C:vncproxy:100:root@pam:
Feb 20 11:50:53 amdtest qm[1790]: VM 100 qmp command failed - VM 100 not running
Feb 20 11:50:53 amdtest pvedaemon[1788]: Failed to run vncproxy.
Feb 20 11:50:53 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:000006FC:00005288:5E4E729C:vncproxy:100:root@pam: Failed to run vncproxy.
Feb 20 11:50:54 amdtest pvedaemon[1812]: start VM 100: UPID:amdtest:00000714:00005305:5E4E729E:qmstart:100:root@pam:
Feb 20 11:50:54 amdtest pvedaemon[1142]: <root@pam> starting task UPID:amdtest:00000714:00005305:5E4E729E:qmstart:100:root@pam:
Feb 20 11:50:54 amdtest pvedaemon[1812]: no efidisk configured! Using temporary efivars disk.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:50:54 amdtest systemd[1]: Started 100.scope.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: Using default interface naming scheme 'v240'.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: Could not generate persistent MAC address for tap100i0: No such file or directory
Feb 20 11:50:54 amdtest kernel: device tap100i0 entered promiscuous mode
Feb 20 11:50:54 amdtest systemd-udevd[1823]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: Could not generate persistent MAC address for fwbr100i0: No such file or directory
Feb 20 11:50:54 amdtest systemd-udevd[1825]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1825]: Using default interface naming scheme 'v240'.
Feb 20 11:50:54 amdtest systemd-udevd[1825]: Could not generate persistent MAC address for fwpr100p0: No such file or directory
Feb 20 11:50:54 amdtest systemd-udevd[1822]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1822]: Using default interface naming scheme 'v240'.
Feb 20 11:50:54 amdtest systemd-udevd[1822]: Could not generate persistent MAC address for fwln100i0: No such file or directory
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Feb 20 11:50:54 amdtest kernel: device fwln100i0 entered promiscuous mode
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered forwarding state
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Feb 20 11:50:54 amdtest kernel: device fwpr100p0 entered promiscuous mode
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered forwarding state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered forwarding state
Feb 20 11:50:54 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4b30]
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: enabling device (0400 -> 0403)
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4b70]
Feb 20 11:50:56 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4b90]
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4bc0]
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest pvedaemon[1889]: starting vnc proxy UPID:amdtest:00000761:0000548C:5E4E72A2:vncproxy:100:root@pam:
Feb 20 11:50:58 amdtest pvedaemon[1141]: <root@pam> starting task UPID:amdtest:00000761:0000548C:5E4E72A2:vncproxy:100:root@pam:
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4c10]
Feb 20 11:50:58 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4c30]
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:59 amdtest pvedaemon[1142]: <root@pam> end task UPID:amdtest:00000714:00005305:5E4E729E:qmstart:100:root@pam: OK
Feb 20 11:50:59 amdtest pvedaemon[1142]: <root@pam> starting task UPID:amdtest:00000767:000054FB:5E4E72A3:vncproxy:100:root@pam:
Feb 20 11:50:59 amdtest pvedaemon[1895]: starting vnc proxy UPID:amdtest:00000767:000054FB:5E4E72A3:vncproxy:100:root@pam:
Feb 20 11:50:59 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:59 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4db0]
Feb 20 11:50:59 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4e40]
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest systemd[1]: Starting Proxmox VE replication runner...
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest systemd[1]: pvesr.service: Succeeded.
Feb 20 11:51:01 amdtest systemd[1]: Started Proxmox VE replication runner.
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4ff0]
Feb 20 11:51:02 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5020]
Feb 20 11:51:03 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5050]
Feb 20 11:51:04 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5080]
Feb 20 11:51:05 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a50b0]
Feb 20 11:51:06 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a50e0]
Feb 20 11:51:07 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5110]
Feb 20 11:51:08 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5140]
Feb 20 11:51:09 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:00000761:0000548C:5E4E72A2:vncproxy:100:root@pam: OK
Feb 20 11:51:09 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5170]
Feb 20 11:51:10 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a51a0]
Feb 20 11:51:11 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a51d0]
Feb 20 11:51:12 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5200]
Feb 20 11:51:13 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5230]
Feb 20 11:51:14 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5260]
Feb 20 11:51:15 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5290]
Feb 20 11:51:16 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a52c0]
Server View
Logs

# Stop guest, Sleeping the host and start again:

Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.

Just tried an older version of the driver from (12/2/2019) which it sounded like some people had working – but doesn’t work either. So many variables; so little time to test every permutation. Meh.

1 Like

oh, well, FWIW I have the same problem on macOS. The patch appears to do its job currently under windows only.

  1. For me; changing this made no difference
  2. Looking at the kernel source the pci_save_state and pci_restore_state do nothing anyway…

Noticed the amdgpu driver doesn’t do any d0 power state; so commented that out to see if that made any difference on my 5700; nope :D.

Looking at the amdgpu patches recently – I noticed theres a way to do the reset via the SMU messages which look different to the ones in use by the patch? I guess the SMU isn’t the same as PPSMC… I don’t know enough about these things to spot what they are; just looking at what the amdgpu driver does vs the patch.

Looking at smu_v11_0_baco_set_state from the kernel they’re using SMU_MSG_EnterBaco

Searching github for SMU_MSG_EnterBaco reveals lots of enum definitions with similar messages but different values.

SMU_MSG_EnterBaco = 0x1E vs PPSMC_MSG_EnterBaco = 0x18
SMU_MSG_ExitBaco = 0x4F vs PPSMC_MSG_ExitBaco = 0x19
SMU_MSG_ArmD3 = 0x59 vs PPSMC_MSG_ArmD3 = 0x46

I’ve tried sending these messages instead – but it just results in SMC errors (0xFe). I then found navi10_message_map which contains MSG_MAP(EnterBaco, PPSMC_MSG_EnterBaco), and after re-reviewing the send method can now see the index = smu_msg_get_index(smu, msg); line that gets the PPSMC index instead… doh… :smiley:

And everything else looks very similar apart from the doorbell addition. Assuming there’s two ways of talking to them; maybe that’s the difference between the 5700 and the XT?..

I tried sending 1 as the arg to the EnterBaco message but also no difference.

I can’t remember if I saw this before but now the GPU fan is stopping when the VM isn’t running; I was messing with some of the kernal parameters so maybe this was involved. Hasn’t fixed the restart issue, but just an observation I didn’t see before now…

Update: I’ve returned the 5700. I’m going to return to evil nvidia for the time being :smiley:

I tried adding the mmBIF_DOORBELL_INT_CNTL and mmTHM_BACO_CNTL pieces but they didn’t seem to make a difference. I’ll play around some more as I find time.

Any idea how to get this working in EXSi? running 6.7

Hello all,

first post on the forum, be gentle :S

I have tried to apply the patch in the kernel but I am having some problems as gcc cannot find the ioremap_nocache function. I found a xtenda_ioremap_nocache function but cannot get it to compile into the kernel.

I am using the “mainline” 5.6-rc4 kernel as of 2020-03-01.
I only patched the quirks.c file from drivers/pci
I used the config from my current kernel: 5.5.1-050501-generic

I am running an Asus X470-F Gaming motherboard running an AMD R2700X, an Asus Strix Vega 64 as main GPU and a Sapphire Pulse RX5700X as pass-through GPU.

I cannot use the vega as pass-through because the motherboard would only use first PCI express slot during boot
And I cannot swap the cards, as the vega card is 2.7 slot and I cannot place it at the bottom (no space).

The error:

drivers/pci/quirks.c: In function ‘reset_amd_navi10’:
drivers/pci/quirks.c:3930:9: error: implicit declaration of function ‘ioremap_nocache’; did you mean ‘ioremap_cache’? [-Werror=implicit-function-declaration]
 3930 |  mmio = ioremap_nocache(mmio_base, mmio_size);
      |         ^~~~~~~~~~~~~~~
      |         ioremap_cache
drivers/pci/quirks.c:3930:7: warning: assignment to ‘uint32_t *’ {aka ‘unsigned int *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
 3930 |  mmio = ioremap_nocache(mmio_base, mmio_size);
      |       ^