You are quite welcome. Learning goes both ways as I’m building a box right now to take on some more permanent VM hosting duties so I want to make sure I know more about how KVM/qemu can fail and how to troubleshoot.
If you dont find anything in the logs, and once you’ve really made sure you are looking at the right stuck process. about the only thing I can think of is to monitor the qemu lifecycle events. Might be a clue there right before it freezes:
virsh qemu-monitor-event <your domain name> --loop --pretty
Next step, now that you’ve done some homework, would be to find out where the qemu devs hang out or what bug system they use and post there. They would likely ask you for the same info and more. A little googling led me here: Support - QEMU. Note, that the would require you to recreate the problem without the libvirt front-end so you would have to setup a test case using just the qemu command line. There are plenty of guides on that.
You could also simply try to boot a different distro than arch (I think I saw that was what you were running) and run your VM under that. As a rolling distro arch exposes you to more upstream bugs. Use a spare drive. Install a distro to it. Copy the VM’s domain.xml and qcow files to the new distro, edit the paths for your qcow disk files, etc. and you can quickly test whether arch is the problem. Better to run something like centos, debian, or ubuntu if you need more stability.
Good luck and post back to this thread if you resolve it.
So I’m currently waiting for a crash again, but I may have fixed the issue. My VM has been running fine with no issue for 12 hours now since I switched from the passthrough’d NIC to the virtual network.
Last time I said I thought I fixed it, it crashed an hour later, so worst case scenario here I get to keep digging!
If it is the NIC, I’m going to try and replicate it and report to the qemu devs as it would be nice to get that resolved.
OR… I just happened across a package I never installed on my host: qemu-arch-extra
I don’t recall if I had that on my previous host OS install, but I installed it last night, so that might have been the missing piece.
I know it isn’t good testing procedure to try 2 things at once, but it should be easy enough to change one variable and try again.
Regardless, I’m going to keep diving into debugging and learning about system calls and that entire process
No crash at all after 36 hours with the virtual NIC.
I disabled the virtual NIC. added back the physical ethernet port with vfio, and it hung after 3 hours.
So now I just need to figure out why
I have a somewhat similar issue, where the kernel occasionally panics (on the host, not the VM guest) on a AMD x570 system, when an HP NC552SFP NIC is present. If I remove the card, the system is stable, so it’s an incompatibility somewhere between this old NIC, new AMD chipset and IOMMU. Same NIC on a Intel Z370M system is fine.
Luckily in my case, the issue can surface only during the boot up process. If the process is successful, and the kernel doesn’t throw these random “amd-vi completion-wait loop timed out” errors, the system is stable. If it does happen, either immediately or hours later, the system will freeze after a bunch of stack crash dumps. And I found out that I can reduce the probability of this happening at bootup by installing the host OS in non-UEFI mode (CSM).
When your VM hangs, do you see any kernel errors in dmesg?