I am using my old card GTS 450 for passing through to my Ubuntu VM, but this card suffers from a reset bug. I have tried detaching the card via command line and then shutdown VM, tried rom dump and feed it in the config - nothing helps. Every time I shut down the VM or reboot, I get black screen and I need to reboot host PC.
Is there any way to power cycle PCIe from host command line in order to initialize it more than once?
BTW, I am on Ryzen 1600 and Asus X370-pro with the latest BIOS (1001).
This is a hardware bug that canāt be fixed. On some GPU that have dual-bios (my Fury Nitro for example) can be reset by switching to the other BIOS.
OK and what about that patch that Wendell is talking about in the stream starting @11:35? Kernel 4.15 patch with power cycle to the PCIe?
Basically this (as I understand at least):
Iām not super up to date on it, because Iām not using Vega or any other GPUs that suffer from this bug (at least, not for passthrough)
Iām going to defer to @mihawk90 and Wendell since theyāre clearly following it more closely.
actually not really, just reading a bit
Still following it more closely than I am. I just stay away from GPU with the issue.
So, for my issue it would be either to somehow reboot or shutdown Ubuntu VM with a some kind of script rather than sudo shutdown or to power cycle on the host system.
I have tried to run this from host OS:
virsh detach-device Ubuntu /mnt/user/system/gpudev.xml
virsh detach-device Ubuntu /mnt/user/system/audiodev.xml
virsh destroy Ubuntu
where xmls are:
GPU
gpudev.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x000' bus='0x29' slot='0x00' function='0x0'/>
</source>
</hostdev>
HDMI audio
audiodev.xml
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x000' bus='0x15' slot='0x00' function='0x1'/>
</source>
</hostdev>
After that it says that device was successfully detached and vm was destroyed, but still I cannot use GPU for the second time without host OS reboot.
Any ideas?
I have been trying to get some support from AMD on this and so far nothing, not a peep, which is a shame because it will only benefit them. If people are willing to donate the required hardware to reproduce this problem I would be willing to spend some time on it and try to resolve it. To be honest I do not agree that it is a hardware bug, there is much we can do to poke at PCI devices on the software side of things that may yield a fix to this problem.
Also , in my case, if I hibernate host PC and bring it back, I can use GTS 450 for VM again. Is there a way to power cycle PCIe from command line or do you think it is also PC power supply related (i.e. power supply power cycles voltages when hybernated or rebooted)?
That is honestly sad to hear ā¦
If you have a funding campaign opened, I would be willing to donate some of my money toward the cause.
I think I remember hearing somewhere that ESXi does not suffer from this issue, but I am not sure that is true.
How have you been trying to contact them? Maybe we should raise hell on twitter in your name.
Weāve been finding that out. Lots of people are ejecting the GPU, killing the power to the PCIe slot, flipping the BIOS selector switch on the physical card and itās solving the problem. I think this can definitely be solved in the VFIO driver.
Directly via their support system, Reddit and Wendell I believe also has tried to get them to come to the table.
Exactly, which is why I am fairly confident that with a bit of time and some hardware this could be resolved.
Iāll try to contact them through my business partnership. Weāre running Intel products in our datacenters. I want to use EPYC and TR, but only if these problems on the kernel are fixed, so money may motivate them.
Thanks mate, that would be awesome!
The NPT is resolved and there are patches for qemu and kvm to pass SMT on Ryzen.
Now if only the pcie !!! Unknown header type 7f
was resolved.
When passing cpu integrated audio I need to use ACS patch to separate the unused sata and have it bound to ahci driver.
This will prevent qemu to reset the device (maybe whole bus). If I dont do this the devices on that bus will all end up with malformed pci header.
Same for Vega. Only difference is that Vega end up like that after host reboots or shutdowns.
Funny thing is that if i dont pass the gpu audio i can do one reboot/powercycle the VM before Vega does not boot.
And from what i read the reset problem seems more like sloppy rom than hw problem. At least one person reported he could reboot the VM with Saphire Vega56.
Same problems with Polaris. Cards from some vendors can be reset and from other cant.
In the video, Wendel talks about a patch that power cycles the PCI device for Linux 4.15. Does anyone have the link to the patch/discussion thread (or mailing list)?
There is some work on this entering the kernel, specifically for the Vega, but AFAIK it doesnāt fix the problem still. I have been informed that the new AGESA may have fixed this problem.
Just so I understand, the future AGESA update for Threadripper might fix the reset bug, but the other bug affecting the Nvidia GPU and the rx580 isnāt fixed, right?
TR fix, correct. I am not sure which NVidia GPU problem you are referring to, it works fine for me without any special fixes except for the NPT patch, but thats specific to the AMD CPU, not NVidia.