Patch NPT on Ryzen for Better Performance | Level One Techs

gelmi · November 6, 2017, 5:01pm

I am using my old card GTS 450 for passing through to my Ubuntu VM, but this card suffers from a reset bug. I have tried detaching the card via command line and then shutdown VM, tried rom dump and feed it in the config - nothing helps. Every time I shut down the VM or reboot, I get black screen and I need to reboot host PC.
Is there any way to power cycle PCIe from host command line in order to initialize it more than once?
BTW, I am on Ryzen 1600 and Asus X370-pro with the latest BIOS (1001).

SgtAwesomesauce · November 6, 2017, 7:44pm

This is a hardware bug that can’t be fixed. On some GPU that have dual-bios (my Fury Nitro for example) can be reset by switching to the other BIOS.

gelmi · November 7, 2017, 5:48pm

OK and what about that patch that Wendell is talking about in the stream starting @11:35? Kernel 4.15 patch with power cycle to the PCIe?

mihawk90 · November 7, 2017, 6:13pm

Basically this (as I understand at least):

Linux Host, Windows Guest, GPU passthrough reinitialization fix Linux

I noticed in Wendell’s recent GPU pass-through live stream, that he mentioned that there isn’t a reliable solution for the AMD GPU reinitialization problem. He mentioned that once a Windows guest VM on linux is shut down, the AMD GPU will refuse to reinitialize, and that this requires a reboot of the Host machine to fix. In my personal experience, this actually seemed to be rather random. Sometimes the VM would be able to boot back up, sometimes not. I found that it was not actually the shut down of the VM, but if the host machine enters a sleep state, while the VM was shut down, after having run at least once. The root issue I suspect has less to do with how Linux is treating the GPU, but the behaviour of the Windows OS while it is a guest. Mainly the Windows guest OS is not sending a signal to the GPU to shut-down or start up while it is running in a VM. I assume this is because when on bare-metal, when a Windows OS shuts down, power is actually killed to the GPU from the motherboard/PSU but when in a guest VM, this does not happen, leaving the GPU in an initialized state. Then if/when the host machine enters a low power state, and power is significantly reduced to the GPU, it enters a non-responsive state after the host machine wakes up. Anyway, I found a fix/work-around. I’ve borrowed bits from around the web for this. I don’t write how-to’s often, so bare with me. (note, all these steps take place inside the Windows Guest OS) Step 1 We are going to need the Windows Device Console (DevCon) It is included bundled with the Windows Driver Kit, which you can download and install like a chump, but you don’t require all the added bloat. So, instead download and install Chocolatey, so you can install it from terminal like a bad-ass. Follow the instructions on the Chocolatey website to install. Once Chocolatey is installed open a Windows command prompt and run this command to install DevCon: choco install devcon.portable DevCon is rather straight fo…

SgtAwesomesauce · November 7, 2017, 6:58pm

I’m not super up to date on it, because I’m not using Vega or any other GPUs that suffer from this bug (at least, not for passthrough)

I’m going to defer to @mihawk90 and Wendell since they’re clearly following it more closely.

mihawk90 · November 7, 2017, 6:59pm

actually not really, just reading a bit

SgtAwesomesauce · November 7, 2017, 7:11pm

Still following it more closely than I am. I just stay away from GPU with the issue.

gelmi · November 7, 2017, 10:37pm

So, for my issue it would be either to somehow reboot or shutdown Ubuntu VM with a some kind of script rather than sudo shutdown or to power cycle on the host system.
I have tried to run this from host OS:

virsh detach-device Ubuntu /mnt/user/system/gpudev.xml
virsh detach-device Ubuntu /mnt/user/system/audiodev.xml
virsh destroy Ubuntu

where xmls are:

GPU
gpudev.xml

  <hostdev mode='subsystem' type='pci' managed='yes'>
    <source>
      <address domain='0x000' bus='0x29' slot='0x00' function='0x0'/>
    </source>
  </hostdev>

HDMI audio
audiodev.xml

  <hostdev mode='subsystem' type='pci' managed='yes'>
    <source>
      <address domain='0x000' bus='0x15' slot='0x00' function='0x1'/>
    </source>
  </hostdev>

After that it says that device was successfully detached and vm was destroyed, but still I cannot use GPU for the second time without host OS reboot.
Any ideas?

gnif · November 8, 2017, 8:32am

I have been trying to get some support from AMD on this and so far nothing, not a peep, which is a shame because it will only benefit them. If people are willing to donate the required hardware to reproduce this problem I would be willing to spend some time on it and try to resolve it. To be honest I do not agree that it is a hardware bug, there is much we can do to poke at PCI devices on the software side of things that may yield a fix to this problem.

gelmi · November 8, 2017, 4:30pm

Also , in my case, if I hibernate host PC and bring it back, I can use GTS 450 for VM again. Is there a way to power cycle PCIe from command line or do you think it is also PC power supply related (i.e. power supply power cycles voltages when hybernated or rebooted)?

CuriousTommy · November 8, 2017, 5:46pm

That is honestly sad to hear …

If you have a funding campaign opened, I would be willing to donate some of my money toward the cause.

I think I remember hearing somewhere that ESXi does not suffer from this issue, but I am not sure that is true.

SgtAwesomesauce · November 8, 2017, 6:07pm

How have you been trying to contact them? Maybe we should raise hell on twitter in your name.

We’ve been finding that out. Lots of people are ejecting the GPU, killing the power to the PCIe slot, flipping the BIOS selector switch on the physical card and it’s solving the problem. I think this can definitely be solved in the VFIO driver.

gnif · November 8, 2017, 6:31pm

Directly via their support system, Reddit and Wendell I believe also has tried to get them to come to the table.

Exactly, which is why I am fairly confident that with a bit of time and some hardware this could be resolved.

SgtAwesomesauce · November 8, 2017, 7:30pm

I’ll try to contact them through my business partnership. We’re running Intel products in our datacenters. I want to use EPYC and TR, but only if these problems on the kernel are fixed, so money may motivate them.

gnif · November 9, 2017, 10:29am

Thanks mate, that would be awesome!

Pixo · November 11, 2017, 9:43am

The NPT is resolved and there are patches for qemu and kvm to pass SMT on Ryzen.

Now if only the pcie !!! Unknown header type 7f was resolved.
When passing cpu integrated audio I need to use ACS patch to separate the unused sata and have it bound to ahci driver.
This will prevent qemu to reset the device (maybe whole bus). If I dont do this the devices on that bus will all end up with malformed pci header.
Same for Vega. Only difference is that Vega end up like that after host reboots or shutdowns.
Funny thing is that if i dont pass the gpu audio i can do one reboot/powercycle the VM before Vega does not boot.
And from what i read the reset problem seems more like sloppy rom than hw problem. At least one person reported he could reboot the VM with Saphire Vega56.
Same problems with Polaris. Cards from some vendors can be reset and from other cant.

CuriousTommy · November 12, 2017, 9:32pm

In the video, Wendel talks about a patch that power cycles the PCI device for Linux 4.15. Does anyone have the link to the patch/discussion thread (or mailing list)?

gnif · November 12, 2017, 10:04pm

There is some work on this entering the kernel, specifically for the Vega, but AFAIK it doesn’t fix the problem still. I have been informed that the new AGESA may have fixed this problem.

CuriousTommy · November 12, 2017, 10:35pm

Just so I understand, the future AGESA update for Threadripper might fix the reset bug, but the other bug affecting the Nvidia GPU and the rx580 isn’t fixed, right?

gnif · November 12, 2017, 10:37pm

TR fix, correct. I am not sure which NVidia GPU problem you are referring to, it works fine for me without any special fixes except for the NPT patch, but thats specific to the AMD CPU, not NVidia.