I have a Proxmox 7.0-11 system on Kernel 5.11.22-3-pve and I am suddenly having trouble using PCIe passthrough with a USB controller built in to the motherboard. I am trying to boot a Windows 10 VM with a RX580, motherboard sound card, and motherboard USB controller passed through, and everything works fine when the USB controller is not passed in, but when it is set up to pass through, the system immediately crashes.
This is a new issue - I suspect that a kernel update caused something to break. Everything was working fine until a power outage caused unexpected downtime (on a UPS so gracefully shutdown) and a reboot caused a new kernel to take place. The bug started on Kernel 5.4 (Proxmox 6.4) and it still affects the current Proxmox kernel.
Here are my complete specs:
Motherboard
ASUSTeK PRIME X470-PRO
CPU
AMD Ryzen 7 2700X
GPU
AMD Radeon RX 580 8GB
RAM
3x G-Skill F4-3200C16-16GVK
The USB controller in question is 0b:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f]
Here is the relevant /var/log/syslog segment after running qm start 102
Aug 14 17:19:04 pve03 qm[86904]: start VM 102: UPID:pve03:00015378:0001A3C8:61184158:qmstart:102:[email protected]:
Aug 14 17:19:04 pve03 qm[86900]: <[email protected]> starting task UPID:pve03:00015378:0001A3C8:61184158:qmstart:102:[email protected]:
Aug 14 17:19:04 pve03 kernel: [ 1074.695912] xhci_hcd 0000:0b:00.3: USB bus 5 deregistered
Aug 14 17:19:04 pve03 systemd[1]: Stopped target Sound Card.
Aug 14 17:19:05 pve03 systemd[1]: Started 102.scope.
Aug 14 17:19:05 pve03 systemd-udevd[86909]: Using default interface naming scheme 'v247'.
Aug 14 17:19:05 pve03 systemd-udevd[86909]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Aug 14 17:19:05 pve03 kernel: [ 1075.594595] device tap102i0 entered promiscuous mode
Aug 14 17:19:05 pve03 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port tap102i0
Aug 14 17:19:05 pve03 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named tap102i0
Aug 14 17:19:05 pve03 ovs-vsctl: ovs|00001|vsctl|INFO|Called as /usr/bin/ovs-vsctl del-port fwln102i0
Aug 14 17:19:05 pve03 ovs-vsctl: ovs|00002|db_ctl_base|ERR|no port named fwln102i0
Aug 14 17:19:05 pve03 systemd-udevd[86912]: Using default interface naming scheme 'v247'.
Aug 14 17:19:05 pve03 systemd-udevd[86912]: ethtool: autonegotiation is uns
When the log cuts off abruptly, the system is completely crashed.
Bumping this because this is still a problem and I have zero clue what’s going wrong (and I’ve finally returned to this problem). I am now on Proxmox PVE 7.1-10 and on kernel version 5.13.19-5-pve.
Some more investigation I’ve done:
I created a tmux session and ran a curl every half second to a logging webserver. When starting the VM, the logs show that the requests stop coming in almost immediately. However, there are two more requests that come in 8 seconds apart after the system hangs. Not exactly sure what this means, other than that the system isn’t completely dead. Here is the tail end of the log with timestamps:
Additionally, here is the tail end of journalctl -o short-precise -k -b -1 for the logs of the last boot (other entries were minutes before this; these stop when the system hangs):
Mar 08 18:49:55.831184 pve03 kernel: xhci_hcd 0000:0b:00.3: remove, state 4
Mar 08 18:49:55.831385 pve03 kernel: usb usb6: USB disconnect, device number 1
Mar 08 18:49:55.831523 pve03 kernel: usb 6-1: USB disconnect, device number 2
Mar 08 18:49:55.843175 pve03 kernel: xhci_hcd 0000:0b:00.3: USB bus 6 deregistered
Mar 08 18:49:55.843335 pve03 kernel: xhci_hcd 0000:0b:00.3: remove, state 1
Mar 08 18:49:55.843448 pve03 kernel: usb usb5: USB disconnect, device number 1
Mar 08 18:49:55.843588 pve03 kernel: usb 5-1: USB disconnect, device number 2
Mar 08 18:49:55.843735 pve03 kernel: usb 5-1.1: USB disconnect, device number 3
Mar 08 18:49:55.955175 pve03 kernel: xhci_hcd 0000:0b:00.3: USB bus 5 deregistered
Mar 08 18:49:56.735192 pve03 kernel: device tap102i0 entered promiscuous mode
Mar 08 18:49:56.755193 pve03 kernel: fwbr102i0: port 1(tap102i0) entered blocking state
Mar 08 18:49:56.755289 pve03 kernel: fwbr102i0: port 1(tap102i0) entered disabled state
Mar 08 18:49:56.755312 pve03 kernel: fwbr102i0: port 1(tap102i0) entered blocking state
Mar 08 18:49:56.755326 pve03 kernel: fwbr102i0: port 1(tap102i0) entered forwarding state
Mar 08 18:49:56.763193 pve03 kernel: device fwln102o0 entered promiscuous mode
Mar 08 18:49:56.775180 pve03 kernel: fwbr102i0: port 2(fwln102o0) entered blocking state
Mar 08 18:49:56.775260 pve03 kernel: fwbr102i0: port 2(fwln102o0) entered disabled state
Mar 08 18:49:56.775283 pve03 kernel: fwbr102i0: port 2(fwln102o0) entered blocking state
Mar 08 18:49:56.775297 pve03 kernel: fwbr102i0: port 2(fwln102o0) entered forwarding state
Mar 08 18:49:57.215200 pve03 kernel: device tap102i1 entered promiscuous mode
Mar 08 18:49:57.227183 pve03 kernel: vmbr1: port 2(tap102i1) entered blocking state
Mar 08 18:49:57.227273 pve03 kernel: vmbr1: port 2(tap102i1) entered disabled state
Mar 08 18:49:57.227310 pve03 kernel: vmbr1: port 2(tap102i1) entered blocking state
Mar 08 18:49:57.227338 pve03 kernel: vmbr1: port 2(tap102i1) entered forwarding state
I really have no idea what’s happening here. As always, any help/pointers are appreciated.
IOMMU Group 16:
09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev c7)
09:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere HDMI Audio [Radeon RX 470/480 / 570/580/590] [1002:aaf0]
...
IOMMU Group 20:
0b:00.3 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD] Zeppelin USB 3.0 Host controller [1022:145f]
...
IOMMU Group 21:
0c:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Zeppelin/Renoir PCIe Dummy Function [1022:1455]
IOMMU Group 22:
0c:00.2 SATA controller [0106]: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] [1022:7901] (rev 51)
IOMMU Group 23:
0c:00.3 Audio device [0403]: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 00h-0fh) HD Audio Controller [1022:1457]
I thought I put targeting the GPU/audio device, USB controller, and sound card but for some reason I had 1022:1455 listed instead of the sound card’s 1022:1457. I don’t remember why this happened or if it was intentional, but it doesn’t seem to cause any problems and I’ve commented out the audio passthrough anyway.
This might be the cause of your problem. The kernel has a driver loaded for the USB controller and when you try to pass it through regardless of that unforseen complications might arrise, like crashes for example. I have no experience if vfio-pci might not be able to handle USB controllers, but you can not have a driver loaded for it. I suggest you do an internet search for vfio-pci and USB controllers to see how to resolve this.
I figure it is a problem with vfio-pci and you need to handle the situation differently since it has loaded correctly for your GPU.
Kernel driver in use: vfio-pci
Kernel modules: amdgpu
Ok, so after some digging it seems you were right- thanks! I was able to temporarily resolve the issue by manually unbinding the xhci_hcd driver and binding it to the vfio-pci driver. For anyone else with the same problem, here’s what I did:
Hmm, no vfio can dynamically unbind xhci-pci when you assign the controller to a vm … it will unbind xhci, bind itself and keep going …
The only case when starting a Vm would crash proxmox for me was when PCI IDs changed and I suddenly tried to pass through my proxmox boot drive thinking it was an USB controller …
can you post /etc/pve/qemu-server/102.conf ?