VM poor performance after update

i am using a Debian VM with Navi10 and PCI passthorugh.
Navi10 suffers from the reset bug, so i installed vendor-reset to the host.
this required me to update the guests firmware-amd-graphics package.
updating this package has ruined the guests performance. the host’s kernel now constantly complains about interrupts taking too long and the guest VM runs atrociously slow.

i can regain the performance of the guest VM by reverting my firmware package, but then the reset bug comes back.

here is the xml of the guest VM, which has not changed at all since i updated the firmware package.

VM XML
<domain type='kvm' id='3'>
  <name>TESSA1</name>
  <uuid>50ec8055-258b-47e0-a20f-7730cf37c879</uuid>
  <memory unit='KiB'>33554432</memory>
  <currentMemory unit='KiB'>33554432</currentMemory>
  <memoryBacking>
    <nosharepages/>
    <locked/>
  </memoryBacking>
  <vcpu placement='static'>12</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='8'/>
    <vcpupin vcpu='1' cpuset='24'/>
    <vcpupin vcpu='2' cpuset='9'/>
    <vcpupin vcpu='3' cpuset='25'/>
    <vcpupin vcpu='4' cpuset='10'/>
    <vcpupin vcpu='5' cpuset='26'/>
    <vcpupin vcpu='6' cpuset='11'/>
    <vcpupin vcpu='7' cpuset='27'/>
    <vcpupin vcpu='8' cpuset='12'/>
    <vcpupin vcpu='9' cpuset='28'/>
    <vcpupin vcpu='10' cpuset='13'/>
    <vcpupin vcpu='11' cpuset='29'/>
    <emulatorsched scheduler='fifo' priority='1'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='5' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='6' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='7' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='8' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='9' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='10' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='11' scheduler='fifo' priority='1'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='1'/>
  </numatune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-q35-7.0'>hvm</type>
    <loader readonly='no' type='rom'>/usr/share/ovmf/OVMFTESSA1.fd</loader>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv mode='custom'>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vpindex state='on'/>
      <runtime state='on'/>
      <synic state='on'/>
      <stimer state='on'/>
      <reset state='on'/>
      <vendor_id state='on' value='randomid'/>
      <frequencies state='on'/>
      <reenlightenment state='on'/>
      <tlbflush state='on'/>
      <ipi state='on'/>
    </hyperv>
    <kvm>
      <hidden state='off'/>
      <hint-dedicated state='on'/>
      <poll-control state='on'/>
    </kvm>
    <vmport state='off'/>
    <ioapic driver='kvm'/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='off'>
    <topology sockets='1' dies='1' cores='6' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
    <feature policy='require' name='svm'/>
    <feature policy='require' name='apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='invtsc'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='yes'/>
    <timer name='hypervclock' present='yes'/>
    <timer name='tsc' present='yes' mode='native'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/media/primaryarray/tessa1mass.img' index='1'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>
    </disk>
    <controller type='pci' index='0' model='pcie-root'>
      <alias name='pcie.0'/>
    </controller>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0xf'/>
      <alias name='pci.1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x7'/>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <alias name='pci.2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0xa'/>
      <alias name='pci.3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0xb'/>
      <alias name='pci.4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0xc'/>
      <alias name='pci.5'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0xd'/>
      <alias name='pci.6'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='pci' index='7' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0xe'/>
      <alias name='pci.7'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x6'/>
    </controller>
    <controller type='pci' index='8' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='8' port='0x8'/>
      <alias name='pci.8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' multifunction='on'/>
    </controller>
    <controller type='pci' index='9' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='9' port='0x18'/>
      <alias name='pci.9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='10' model='pcie-to-pci-bridge'>
      <model name='pcie-pci-bridge'/>
      <alias name='pci.10'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <controller type='sata' index='0'>
      <alias name='ide'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'>
      <alias name='input0'/>
    </input>
    <input type='keyboard' bus='ps2'>
      <alias name='input1'/>
    </input>
    <audio id='1' type='none'/>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x0a' slot='0x10' function='0x2'/>
      </source>
      <alias name='hostdev0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>
      </source>
      <boot order='1'/>
      <alias name='hostdev1'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x48' slot='0x00' function='0x0'/>
      </source>
      <alias name='hostdev2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0' multifunction='on'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x48' slot='0x00' function='0x1'/>
      </source>
      <alias name='hostdev3'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x1'/>
    </hostdev>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <driver name='vfio'/>
      <source>
        <address domain='0x0000' bus='0x49' slot='0x00' function='0x3'/>
      </source>
      <alias name='hostdev4'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='none'/>
    <shmem name='looking-glass'>
      <model type='ivshmem-plain'/>
      <size unit='M'>32</size>
      <alias name='shmem0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </shmem>
  </devices>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+0:+64055</label>
    <imagelabel>+0:+64055</imagelabel>
  </seclabel>
</domain>

what have i done wrong here?

set your CPU type to ‘host’
or passthrough all of the CPU instructions sets of the HOST cpu into the VM. it looks like you just have the generic kvm CPU in use in the VM. unless i missed it.

1 Like

I’m using host-passthrough
I’m my VM xml
cpu mode=‘host-passthrough’ check=‘none’ migratable=‘off’

1 Like

Have you checked in the vm with lscpu or something to verify it is working?

Also is this an x58 or 55xx intel host by chance?

1 Like

Yes. The guest VM believes it’s using the same model CPU as the host. All of this worked just fine (except for the reset bug) before I had to upgrade my firmware package.

This host is an AMD Threadripper

1 Like

Yeah these gpu passthrough confiig are finicky and work great when they work, then i started using my mouse in my left hand and gpu passthrough failed.

Can you try a newer kernal with the newer firmware package?

1 Like

Both host and guest are running Linux 6.0.
The host will not boot on any kernel below 5.16.
I have this same behavior on every major kernel in between. I keep backups of all my old kernels.

If only @gnif could make vendor-reset compatible with the older firmware packages. I’m pretty sure that’s impossible though.

1 Like

Yeah the only other thing i was going to ask is if you have tried running a kernel patch instead of vendor-reset. Might be something to try if you can find a patch file for a recent kernel.

1 Like

Is there any patch that solves the AMD reset bug?
I’m not aware of such a thing existing. I would imagine it would get mainlined very quickly if it existed.

As far as I know, vendor-reset is the best shot anybody has to solve the reset bug.

1 Like

There was a navireset.patch at one time. Im not sure what happened, if it was mainlined or what.

What you discribe sounds like vfio config issue but if it only happens with a certain firmware package then maybe a driver version mismatch. I am at a loss here.

1 Like

Vendor-reset is the reset sequences lifted from the amdgpu driver itself, there is no “compatibility” with the firmware, nor does it have any dependency on them.

3 Likes

This was also my patch and was deprecated in favor of vendor-reset, it should not be used, it’s incomplete and buggy. It was never upstreamed for this reason.

3 Likes

Then why does vendor-reset simply not function when the guest is running the older firmware package?

Makes no sense to me either. I was of the belief that the ability to reset the GPU was entirely on the host’s configuration, and independent of the guest.

i stumbled upon this and it sounds similar, does it look related?

1 Like

doesnt seem to be related.

vendor-reset functions fine, the GPU is clearly behaving differently due to different firmware. Again, vendor-reset does not depend on the firmware, it should always work, but who knows what bugs/issues are in the closed source binary blobs AMD push out.

Completely unrelated sorry mate

so while vendor-reset does not rely on the firmware package. the firmware package could cause vendor-reset to not function, based on the unknowns contained inside said firmware package. this is plausible and explains the randomness of getting a system to work, and then i sneezed 3 times on a Tuesday after eating a carrot, and it stops working.

1 Like

unfortunatly yes, we are not privy to the internals of the binary firmware. While we are doing exactly what we should to issue the reset commands, there could be any one of a million reasons why it could fail and without the source to these firmwares and debugging hardware needed to obtain the cause of the hang/crash of the GPU, we are literally stuck.

All vendor-reset does is command the GPU to perform a BACO reset (Bus Active Controller Off), which is an AMD specific reset, it’s not even a true PCI reset (which is the main issue with these GPUs, they dont support a standard reset) which is why it’s flakey.

Crossing fingers that the next gen GPUs finally have this fixed… but not holding my breath for it.

1 Like

the unfortunate part is it is so hard, nigh on impossible, to mix versions of things. i can fight for a week to have the perfect combo to get OpenCL working, but not have vendor-reset, or maybye get OpenCL and vendor-reset, but not actually have acceleration in 3d. or have a fully working system and then a seemingly unrelated update borks the entire thing.

honestly recently i gave up and just use seperate machines for each part and i am not doing GPU passthrough now even. it has honestly saved me frustration.

I am sorry, I don’t have a solution for you. As much as I wish AMD to beat the snot out of NVIDIA, until they fix this issue, I simply have to recommend people do not use AMD GPUs for VFIO pass-through and instead use NVIDIA.