VM broke after upgrade

So this has happened a couple times over the past year. I can't remember what I did to fix this problem. I am not even sure if virt-manager broke or something with my VM setup.

It's entirely possible that the reason I am having this problem in the first place is because I am running Fedora 23 with a bleeding edge kernel. I was running 4.9 rc3 for a week or two and it was very stable. This morning I decided to do an upgrade and it installed 4.9 rc5, plus it updated a bunch of other packages. I am tentatively planing on installing Fedora 25 next week and starting over, hopefully recreating my VM setup.

So here's what happens: I open Virtual Machine Manager, select the one I want to start (Windows 10 Pro), and hit the play button. The monitor connected to the video card that is passed through to the VM immediately goes to sleep and three of the four cores assigned to the VM max out. Two real cores max out, and one of the two HT cores maxes out. Not sure why the last HT core isn't used.

I just want to learn how to troubleshoot this sort of problem. So if anyone could help me out that would be fantastic.

Virt-Manager is just a tool to control Qemu/KVM created vm's and easy without using commands to create a virtual network setup. So that ain't broken. To troubleshoot this you gotta read back logs and error reports. Maybe using commands like dmesg -w and journalctl -f to follow the events on you're computer towards the bug. Also you can look bug manual into logs if you go into the directory /var/log the var directory/partition is usually the place inside Linux that houses the log files.

but this all happened when you move to a new kernell?
Also if you love bleeding edge its better to use Fedora's Rawhide. More developers in that sector and you got some support from the Fedora team.
Because for this issue you are kinda on you're own.

The only reason I switched to the bleeding edge kernel when I made this system was for the Skylake support. I don't plan on doing the same thing with Fedora 25.

I used journalctl -f to capture the logs as I started the VM. Here's the output http://pastebin.com/Ux7rS5z4

I think what I may need to do is set up the passthrough again from scratch. I really don't want to, but I think that may be where it's hanging.

Did you do dnf update? (not that you should, might break you're setup.)

but yeah seeing this it seams like 1 or 2 things are not set right and it's a maze to find the error. You really have to backtrack everything you done since last time it worked.

I always do dnf upgrade when updating my systems. Apparently both do the same thing.

Yeah, something isn't set up right. At the very end of the log it looks like something tries to map an address to a device. Then it just sits with the three cores maxed and the log file doesn't update.

I'll have to go through the passthrough guide and see where it broke. I may have to nuke it and start the passthrough over.

This is the kind of shit that pisses me off about Linux. Something as simple as updating the system shouldn't break things to the point that you have to be fucking Linus Torvalds to figure out what happened. I really, really hope this is happening because I am using the bleeding edge RC kernels. Hopefully this stops happening when I move to Fedora 25.

Don't get me wrong, though. I love Linux, and am happy to have ditched Windows. But sometimes it can be a pain in the ass.

Fedora indeed stopped using update and upgrade and just make them do the same. In Debian you still update you're repo's and then upgrade you're system with upgrade.

You ended up there doing sudo apt-get update && apt-get upgrade all the time.

ps: you doing passtrough stuff that is not to be taken lightly and second updates in Windows will break you're system too. On every OS, but luckily in Linux you can do manual updates for just certain parts. That what you should do in the future if you intend to keep a OS for you're own personal use for longer period.

Gentoo and Arch(i was told) are good for this.

Yeah, sorry for venting a bit. I'm usually pretty levelheaded, but this just frustrates me for some reason.

And yup, doing passthrough of hardware to a VM is (probably) an inherently risky, unstable business. I should lower my expectations a bit I suppose.

As an aside I discovered the dnf history command. Holy crap that's amazing. So I checked the upgrade I did this morning and the only thing that would be directly related to this situation would be the UEFI firmware that the VM uses was updated. It's called 'edk2.git-ovmf.' No idea if that would have any impact.

remove the passthrough'd card and an add one of the virtual displays see if that works? Fedora has pretty much been like that since f23 (what happened to stability, I mean arch is more stable?)

cool thing with windows is if you install it to the same storage it'll pop the old installation into a windows.old folder

ovmf is the uefi that qemu is using, its not part of fedora

edit: that didnt really say what i meant

you can probably go back to an old version and remove the repo for it?

I remember the ovmf stuff from when I set the VM up to use UEFI instead of classic BIOS.

Now that I have seen that it reminds me of the last time this happened, and I want to say it was because of the ovmf getting updated then too.

I'll have to figure out how to roll back that specific package. And maybe tell it to not update that package anymore.

for troubleshoot purposes anything you done after last time it worked can have influence and only way to find out is with reversing it.

and yes there is a list you can make of packages not to upgrade automatic. Had to look it up..

Thanks, I'll use that to exclude that package.

When I tried to downgrade the package it said that it was already the lowest version available.

Yeah from the current repo it is the lowest.
Think you have to downgrade you're repo.

That is the negative part of upgrade and update in one.
Tho not 100% sure if it makes sense what i say now

I guess I don't understand why the UEFI firmware that the VM uses getting updated would cause the VM to not boot. Where would I look to see if that is actually what is causing the hang? I don't see any evidence of that in the journalctl -f dump.

I have no idea how to downgrade to the previous version in any case.

and dmesg?

but its hard to debug this yeah and might be harder to repair at least in Fedora because i see there are also some issues with SELinux and that some config files couldn't be read (probably due security changed).

Did you follow a guide when setting up you're pass-thorugh?

Here's dmesg -w:

[ 8833.775524] xhci_hcd 0000:02:00.0: remove, state 4
[ 8833.775540] usb usb4: USB disconnect, device number 1
[ 8833.775543] usb 4-1: USB disconnect, device number 2
[ 8833.893377] xhci_hcd 0000:02:00.0: USB bus 4 deregistered
[ 8833.893490] xhci_hcd 0000:02:00.0: remove, state 1
[ 8833.893495] usb usb3: USB disconnect, device number 1
[ 8833.893495] usb 3-1: USB disconnect, device number 2
[ 8833.893496] usb 3-1.2: USB disconnect, device number 10
[ 8833.925888] usb 3-2: USB disconnect, device number 3
[ 8833.990556] usb 3-3: USB disconnect, device number 5
[ 8833.990661] usb 3-4: USB disconnect, device number 7
[ 8834.056412] xhci_hcd 0000:02:00.0: USB bus 3 deregistered
[ 8834.744368] vfio_ecap_init: 0000:01:00.0 hiding ecap [email protected]
[ 8834.744375] vfio_ecap_init: 0000:01:00.0 hiding ecap [email protected]

This is what shows up as soon as I attempt to start the VM.

I have two devices setup for passthrough: a GTX 970 and a USB3 PCI-e card. They are using VFIO.

I tried removing the cards from the VM and adding a display. I got nothing. Black screen and same maxed cores.

I used this guide for the passthrough: vfio.blogspot.com/2015/05/vfio-gpu-how-to-series-part-3-host.html

edit: the last two lines in the dmesg bug me. The video card has two components: the video card itself and an audio component. They are, respectively, 0000:01:00:0 and 0000:01:00:1. And the USB card is 0000:02:00:0. I only see the first half of the video card being referenced.

Hmm could it be you're graphics card changed too during upgrade?

Beside the blackscreen does windows start or not?
I mean can you ping to it or find it with samba.

But if you do this sort of stuff its best not to upgrade automatic. Even in Fedora you can manual upgrade components. with just using dnf upgrade package

So I have gone through part three of the guide (here and also here) and have made sure that everything relating to the hardware passthrough of the video card and the USB card are correct. And they are. It is set up exactly how it should be.

I checked that both the video card and the USB card are being assigned the vfio driver, and for sure the video card is. The USB card, for some reason, is showing that the driver in use is 'xhci-hcd.' Not sure why that is. It is set up identical to the video card. I removed the USB card from the VM settings just to see if that helped. It didn't.

At boot I get four errors relating to ACPI:

ACPI Error: [\_SB_.PCI0.XHC_.RHUB.HS11] Namespace lookup failure, AE_NOT_FOUND (20160831/dswload-210)
ACPI Exception: AE_NOT_FOUND, During name lookup/catalog (20160831/psobject-227)
ACPI Exception: AE_NOT_FOUND, (SSDT:xh_rvp08) while loading table (20160831/tbxfload-228)
ACPI Error: 1 table load failures, 8 successful (20160831/tbxfload-246)

This shows up right after selecting the kernel to boot. I have no idea if this is relevant.

So, yeah. I have no idea where to go from here. It just doesn't work.

does this tread might help you?
@GrayBoltWolf Think might be suited to help you better since he did it recently.
For me its a while i did this.

I just read through that thread. His walk through is extremely similar to the guide I followed, just geared more for Debian. There are a couple things I'd like to try that might streamline a couple steps.

Right now I am trying to figure out how to properly transfer the VM to another system. I am copying the VM img files from my desktop to an external hard drive, which will then move to my server. I am not sure how to transfer the settings. I want to see if the VM will boot on that system where I have done absolutely zero hardware passthrough work (yet). I am thinking it is done with the 'migrate' option, but that is grayed out.

I have a very basic Windows 7 VM that boots fine. It doesn't do any passthrough, just the video in the window. I also have an 8.1 VM that won't boot, and that was created using OVMF same as the Windows 10 VM.

This leads me to think that something happened with the OVMF UEFI stuff.

I don't know if I should just jump into that thread and link to my thread asking for help.