Navi Reset Kernel Patch

Yes, but my contacts have confirmed that this parameter is ignored and can be omitted for the other SMC commands.

1 Like

I added some extra logging and lengthened the delays to see what route it was taking; and it looks like it only does the baco reset once… and then never tries again since the sign of life indicator is then 0x00 and it just jumps past it each time…

When I do a power off and on of the guest I can hear the card make a huff noise; assuming the fans are momentarily stopping and starting…

I didn’t print the first value; but assume its >0 the first time but not -1 since I don’t see the warning.

Will try and do more investigation later/tomorrow.

2 Likes

Did you add the video card’s room to your VM? I have to do that no matter what, even if I don’t pass the boot GPU through.

Got the rom from GPU-Z; and passed through – makes no difference; works first time after host boot/wake and then never again after that, If I comment out that condition that jumps to the out label I then get SMU error 0x0 (line 4029) errors for each of the waits. I guess it isn’t ready :D.

Couple of things I noticed:

  1. The final exit doesn’t write 0x0 to mmMP1_SMN_C2PMSG_82 – but looking at other examples of this process elsewhere in the kernel the send without wait doesn’t either.
  2. The pci_save_state happens inside the block that gets jumped over – so in the case when SOL==0x0, the pci_restore_state happens when it wasn’t saved first – does it matter?
  3. What is the difference between SOC and SOL; they look like the same thing?

I’m still a bit confused about how this is working…

  1. Get memory access to send commands
  2. Set power state to d0 to put into low power mode
  3. Wait until we can send messages
  4. if( sol != 0 ){
  5. Save state
  6. Send Arm d3 and wait
  7. Send BACO enter and wait for a response
  8. Long delay
  9. Send BACO exit and wait for a response
  10. Wait for the SOC to become valid
    }
  11. Restore state

In my case; it appears like the subsequent VM starts don’t have the GPU in the required state to even reset it… and trying to reset it results in SMU errors when you try since it’s not responding. I’ve added some more logging and just waiting for a rebuild; assuming the guest itself has a lot to answer for depending on what state it leaves the GPU in when it shuts down…

The log from the various steps:

# first boot of guest

Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: enabling device (0400 -> 0403)
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Feb 20 11:48:17 amdtest kernel: vfio-pci 0000:0b:00.1: enabling device (0000 -> 0002)
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:48:18 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.

# stopping the guest:

Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x7e1c805
Feb 20 11:49:36 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: performing BACO reset
Feb 20 11:49:36 amdtest pvedaemon[1618]: starting vnc proxy UPID:amdtest:00000652:000034B9:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest pvedaemon[1142]: <root@pam> starting task UPID:amdtest:00000652:000034B9:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest pvedaemon[1620]: starting vnc proxy UPID:amdtest:00000654:000034BA:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest pvedaemon[1141]: <root@pam> starting task UPID:amdtest:00000654:000034BA:5E4E7250:vncproxy:100:root@pam:
Feb 20 11:49:36 amdtest qmeventd[793]:  OK
Feb 20 11:49:36 amdtest qmeventd[793]: Finished cleanup for 100
Feb 20 11:49:36 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:0000063D:0000340D:5E4E724E:qmstop:100:root@pam: OK
Feb 20 11:49:37 amdtest qm[1622]: VM 100 qmp command failed - VM 100 not running
Feb 20 11:49:37 amdtest pvedaemon[1618]: Failed to run vncproxy.
Feb 20 11:49:37 amdtest qm[1623]: VM 100 qmp command failed - VM 100 not running
Feb 20 11:49:37 amdtest pvedaemon[1142]: <root@pam> end task UPID:amdtest:00000652:000034B9:5E4E7250:vncproxy:100:root@pam: Failed to run vncproxy.
Feb 20 11:49:37 amdtest pvedaemon[1620]: Failed to run vncproxy.
Feb 20 11:49:37 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:00000654:000034BA:5E4E7250:vncproxy:100:root@pam: Failed to run vncproxy.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC register to become valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC Valid.
Feb 20 11:49:40 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:49:40 amdtest systemd[1]: 100.scope: Succeeded.

# and then starting the guest:

Feb 20 11:50:00 amdtest systemd[1]: Starting Proxmox VE replication runner...
Feb 20 11:50:01 amdtest systemd[1]: pvesr.service: Succeeded.
Feb 20 11:50:01 amdtest systemd[1]: Started Proxmox VE replication runner.
Feb 20 11:50:52 amdtest pvedaemon[1788]: starting vnc proxy UPID:amdtest:000006FC:00005288:5E4E729C:vncproxy:100:root@pam:
Feb 20 11:50:52 amdtest pvedaemon[1141]: <root@pam> starting task UPID:amdtest:000006FC:00005288:5E4E729C:vncproxy:100:root@pam:
Feb 20 11:50:53 amdtest qm[1790]: VM 100 qmp command failed - VM 100 not running
Feb 20 11:50:53 amdtest pvedaemon[1788]: Failed to run vncproxy.
Feb 20 11:50:53 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:000006FC:00005288:5E4E729C:vncproxy:100:root@pam: Failed to run vncproxy.
Feb 20 11:50:54 amdtest pvedaemon[1812]: start VM 100: UPID:amdtest:00000714:00005305:5E4E729E:qmstart:100:root@pam:
Feb 20 11:50:54 amdtest pvedaemon[1142]: <root@pam> starting task UPID:amdtest:00000714:00005305:5E4E729E:qmstart:100:root@pam:
Feb 20 11:50:54 amdtest pvedaemon[1812]: no efidisk configured! Using temporary efivars disk.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:50:54 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:50:54 amdtest systemd[1]: Started 100.scope.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: Using default interface naming scheme 'v240'.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: Could not generate persistent MAC address for tap100i0: No such file or directory
Feb 20 11:50:54 amdtest kernel: device tap100i0 entered promiscuous mode
Feb 20 11:50:54 amdtest systemd-udevd[1823]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1823]: Could not generate persistent MAC address for fwbr100i0: No such file or directory
Feb 20 11:50:54 amdtest systemd-udevd[1825]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1825]: Using default interface naming scheme 'v240'.
Feb 20 11:50:54 amdtest systemd-udevd[1825]: Could not generate persistent MAC address for fwpr100p0: No such file or directory
Feb 20 11:50:54 amdtest systemd-udevd[1822]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.
Feb 20 11:50:54 amdtest systemd-udevd[1822]: Using default interface naming scheme 'v240'.
Feb 20 11:50:54 amdtest systemd-udevd[1822]: Could not generate persistent MAC address for fwln100i0: No such file or directory
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered disabled state
Feb 20 11:50:54 amdtest kernel: device fwln100i0 entered promiscuous mode
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 1(fwln100i0) entered forwarding state
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered disabled state
Feb 20 11:50:54 amdtest kernel: device fwpr100p0 entered promiscuous mode
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered blocking state
Feb 20 11:50:54 amdtest kernel: vmbr0: port 2(fwpr100p0) entered forwarding state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered disabled state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered blocking state
Feb 20 11:50:54 amdtest kernel: fwbr100i0: port 2(tap100i0) entered forwarding state
Feb 20 11:50:54 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:55 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4b30]
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: enabling device (0400 -> 0403)
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x19@0x270
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x1b@0x2d0
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x25@0x400
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x26@0x410
Feb 20 11:50:56 amdtest kernel: vfio-pci 0000:0b:00.0: vfio_ecap_init: hiding ecap 0x27@0x440
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:56 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4b70]
Feb 20 11:50:56 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4b90]
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:57 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4bc0]
Feb 20 11:50:57 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest pvedaemon[1889]: starting vnc proxy UPID:amdtest:00000761:0000548C:5E4E72A2:vncproxy:100:root@pam:
Feb 20 11:50:58 amdtest pvedaemon[1141]: <root@pam> starting task UPID:amdtest:00000761:0000548C:5E4E72A2:vncproxy:100:root@pam:
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:50:58 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:58 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4c10]
Feb 20 11:50:58 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4c30]
Feb 20 11:50:58 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:59 amdtest pvedaemon[1142]: <root@pam> end task UPID:amdtest:00000714:00005305:5E4E729E:qmstart:100:root@pam: OK
Feb 20 11:50:59 amdtest pvedaemon[1142]: <root@pam> starting task UPID:amdtest:00000767:000054FB:5E4E72A3:vncproxy:100:root@pam:
Feb 20 11:50:59 amdtest pvedaemon[1895]: starting vnc proxy UPID:amdtest:00000767:000054FB:5E4E72A3:vncproxy:100:root@pam:
Feb 20 11:50:59 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:50:59 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4db0]
Feb 20 11:50:59 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4e40]
Feb 20 11:51:00 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:00 amdtest systemd[1]: Starting Proxmox VE replication runner...
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest systemd[1]: pvesr.service: Succeeded.
Feb 20 11:51:01 amdtest systemd[1]: Started Proxmox VE replication runner.
Feb 20 11:51:01 amdtest kernel: AMD-Vi: Completion-Wait loop timed out
Feb 20 11:51:01 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a4ff0]
Feb 20 11:51:02 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5020]
Feb 20 11:51:03 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5050]
Feb 20 11:51:04 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5080]
Feb 20 11:51:05 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a50b0]
Feb 20 11:51:06 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a50e0]
Feb 20 11:51:07 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5110]
Feb 20 11:51:08 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5140]
Feb 20 11:51:09 amdtest pvedaemon[1141]: <root@pam> end task UPID:amdtest:00000761:0000548C:5E4E72A2:vncproxy:100:root@pam: OK
Feb 20 11:51:09 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5170]
Feb 20 11:51:10 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a51a0]
Feb 20 11:51:11 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a51d0]
Feb 20 11:51:12 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5200]
Feb 20 11:51:13 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5230]
Feb 20 11:51:14 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5260]
Feb 20 11:51:15 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a5290]
Feb 20 11:51:16 amdtest kernel: iommu ivhd0: AMD-Vi: Event logged [IOTLB_INV_TIMEOUT device=0b:00.0 address=0x42c9a52c0]
Server View
Logs

# Stop guest, Sleeping the host and start again:

Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: pci_set_power_state
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: Waiting for the SOC to be ready.
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOC ready.
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: sol register = 0x0
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: SOL true; going to out
Feb 20 11:56:38 amdtest kernel: vfio-pci 0000:0b:00.0: Navi10: In out.

Just tried an older version of the driver from (12/2/2019) which it sounded like some people had working – but doesn’t work either. So many variables; so little time to test every permutation. Meh.

1 Like

oh, well, FWIW I have the same problem on macOS. The patch appears to do its job currently under windows only.

  1. For me; changing this made no difference
  2. Looking at the kernel source the pci_save_state and pci_restore_state do nothing anyway…

Noticed the amdgpu driver doesn’t do any d0 power state; so commented that out to see if that made any difference on my 5700; nope :D.

Looking at the amdgpu patches recently – I noticed theres a way to do the reset via the SMU messages which look different to the ones in use by the patch? I guess the SMU isn’t the same as PPSMC… I don’t know enough about these things to spot what they are; just looking at what the amdgpu driver does vs the patch.

Looking at smu_v11_0_baco_set_state from the kernel they’re using SMU_MSG_EnterBaco …

Searching github for SMU_MSG_EnterBaco reveals lots of enum definitions with similar messages but different values.

SMU_MSG_EnterBaco = 0x1E vs PPSMC_MSG_EnterBaco = 0x18
SMU_MSG_ExitBaco = 0x4F vs PPSMC_MSG_ExitBaco = 0x19
SMU_MSG_ArmD3 = 0x59 vs PPSMC_MSG_ArmD3 = 0x46

I’ve tried sending these messages instead – but it just results in SMC errors (0xFe). I then found navi10_message_map which contains MSG_MAP(EnterBaco, PPSMC_MSG_EnterBaco), and after re-reviewing the send method can now see the index = smu_msg_get_index(smu, msg); line that gets the PPSMC index instead… doh… :smiley:

And everything else looks very similar apart from the doorbell addition. Assuming there’s two ways of talking to them; maybe that’s the difference between the 5700 and the XT?..

I tried sending 1 as the arg to the EnterBaco message but also no difference.

I can’t remember if I saw this before but now the GPU fan is stopping when the VM isn’t running; I was messing with some of the kernal parameters so maybe this was involved. Hasn’t fixed the restart issue, but just an observation I didn’t see before now…

Update: I’ve returned the 5700. I’m going to return to evil nvidia for the time being :smiley:

I tried adding the mmBIF_DOORBELL_INT_CNTL and mmTHM_BACO_CNTL pieces but they didn’t seem to make a difference. I’ll play around some more as I find time.

Any idea how to get this working in EXSi? running 6.7

Hello all,

first post on the forum, be gentle :S

I have tried to apply the patch in the kernel but I am having some problems as gcc cannot find the ioremap_nocache function. I found a xtenda_ioremap_nocache function but cannot get it to compile into the kernel.

I am using the “mainline” 5.6-rc4 kernel as of 2020-03-01.
I only patched the quirks.c file from drivers/pci
I used the config from my current kernel: 5.5.1-050501-generic

I am running an Asus X470-F Gaming motherboard running an AMD R2700X, an Asus Strix Vega 64 as main GPU and a Sapphire Pulse RX5700X as pass-through GPU.

I cannot use the vega as pass-through because the motherboard would only use first PCI express slot during boot
And I cannot swap the cards, as the vega card is 2.7 slot and I cannot place it at the bottom (no space).

The error:

drivers/pci/quirks.c: In function ‘reset_amd_navi10’:
drivers/pci/quirks.c:3930:9: error: implicit declaration of function ‘ioremap_nocache’; did you mean ‘ioremap_cache’? [-Werror=implicit-function-declaration]
 3930 |  mmio = ioremap_nocache(mmio_base, mmio_size);
      |         ^~~~~~~~~~~~~~~
      |         ioremap_cache
drivers/pci/quirks.c:3930:7: warning: assignment to ‘uint32_t *’ {aka ‘unsigned int *’} from ‘int’ makes pointer from integer without a cast [-Wint-conversion]
 3930 |  mmio = ioremap_nocache(mmio_base, mmio_size);
      |       ^

Use ioremap directly instead of ioremap_nocache, the latter was removed as the former does exactly the same thing.

thanks, that was it, it worked to compile; just had a problem with the kernel config so I couldn’t test it out yet

is there anything that could be done during a guest’s shutdown script that could make resetting Navi more reliable?
i have both a Linux and a Windows guest i can test with.

What can be done is what this patch already does, that’s it. I have found that it’s random also, sometimes shutting down the guest completely breaks the reset, and doing a hard reset of the guest (qemu monitor commandsystem_reset) will often work provided the guest had not crashed.

i have a machine with 2 identical Navi 10 cards. and i have 2 VMs each with one Navi passed through.

when i try to reboot one VM without the patch, the host throws an error and only the VM that attempted the reboot is disabled until the host reboots.

when i try this with the patch, the VM that is not at all trying to reboot gets killed and disabled along with the VM that is trying to reboot.

is this the expected behavior of the patch? or is there some PEBCAK error in my system thats causing this?

Please read though this thread, the patch is incomplete and doesn’t fuction correctly in some situations/configurations still for unknown reasons.

2 Likes

out of curiosity: what makes AMDGPU able to BACO reset Navi10, but VFIO unable to?
what’s stopping VFIO from just using the same BACO reset procedure that AMDGPU uses?

That’s exactly what this patch does… but even the amdgpu driver doesn’t always reset/recover the GPU. It also does a ton of extra stuff that would never be accepted as a PCI quirk such as loading firmware into the SOC and booting it, etc.

2 Likes

I am following this thread to stay up to date on this.
@gnif: Out of curiosity: you say AMD is cooperative at your end, and yet the “tons of extra stuff” maybe makes AMDs intervention unavoidable.
Does something like a post-sales VBIOS upgrade, “made by AMD” actually happen in reality? I’ve only ever seen those ripped from newer cards.
thanks

Some of AMD’s staff stepped up and started to help with this but their hands are tied by the NDAs they are under. AMD as a company don’t seem to care about fixing this properly with a firmware and/or silicon update.

5 Likes

I see two ways of “not care” possible: a) not wanting to throw money at it, and b) not wanting to “letting it fix itself just by allowing it”.
To allow somebody, to tell something, that is bound by NDA, is relatively cheap, in money terms.
Is there a way the public can do something?
I bet it’s not hammering the support lines, because that will just hit the wrong wall.

To me it looks very strange. AMD is said to embrace linux, but is miles behind Nvidia in anything SR-IOV. I think it is possible that virglrenderer could change the game to AMDs advantage in the far future, but if they “dont care people doing it” (see Nvidia Err 43), they could as well grab the opportunity and push all GPUs into guests. It’s a missed chance, given there are hardly any AMD cards that support VM.