Linux VM Performance Issues under TrueNas Scale

I’m running TrueNas Scale as a file server (mostly media), and I’m also running a Linux (pop_os 22.04) VM as a companion “media pc.”

I recently upgraded the hardware from E5 2667v2 to an E5 2620v4. I recreated the VM (didn’t migrate it), but I installed the same version, software, drivers, and updates. the VM has 4 cores, 8GB RAM, a 60GB SSD, and a GT 1030 (same specs as prior - i’ve tried more cores/ram but this doesn’t affect anything).

I’m seeing performance issues after 15 minutes - an hour of video playback:

  • truenas shows > 100% vcpu usage (usually up to 150 or 180%)
  • guest shows chrome using ~80-100% and pipewire using about 50%
  • guest video playback slows and becomes choppy, gets worse the longer you leave it (but so far has not actually crashed anything)
  • guest is otherwise responsive
  • pausing playback and waiting for 30-60 seconds or so allows cpu usage to drop and then everything’s completely normal again.

I’ve not been able to find much to help troubleshooting. I’ve enabled MSI on the guest but this doesn’t change anything.

Anyone have advice on what I should be looking into next?
thanks!

Are you by any chance using USB passthrough with that guest VM? The fact that pipewire is using lots of CPU suggests that audio may be a cause or a symptom here. When using USB passthrough I’ve heard of issues with low latency and high poll devices like audio and and mice.

So I’d suggest disabling any USB passthrough that you’re using and try again. If you need USB I’d suggest doing PCIe passthrough with a USB card (Or seeing if the system USB hubs are part of their own IOMMU group that can be passed through entirely). Otherwise you may be able to use HDMI audio from the GPU.

If you aren’t using USB passthrough could you have a look at dmesg on the guest and host?

I would start by diffing the original VM’s libvirt domain XML against the new one to see what has changed. You might have added some tuning options a few years ago and forgotten about it. What did the guest CPU utilization look like before the upgrades?

I don’t know how much control TrueNAS SCALE gives you, but I think it uses libvirt/QEMU on the backend and that’s what I use in plain ol’ Linux. For the best guest performance you should pin the cores and prevent the host from using them. There are several ways to accomplish the latter; the VFIO pages on the Arch wiki are helpful. You also want to make sure the topology lines up between the host and the guest. For example, if you decide to pin two cores with two threads each, make sure vcpu 0 and vcpu 1 line up with the first physical core and its logical processor and vcpu 2 and vcpu3 line up with the second physical core and its logical processor. And set the vcpu topology to sockets="1" cores="2" threads="2". The next step would be to consider pinning the emulator thread, and—depending on the guest’s storage configuration—possibly one or more I/O thread(s) as well.

CPU performance is high in Chrome, which makes me wonder if GPU rendering is working correctly. Since you’re passing through a GPU, make sure it’s properly excluded from the host OS. This usually means adding it to the vfio_pci.ids list, possibly blacklisting the nouveau module, and possibly adding vfio_pci.disable_vga=1 and/or video=efifb:off to the kernel command line. If the system is trying to initialize the GT 1030 as the primary graphics on boot, going into the BIOS and forcing the primary graphics to the onboard graphics (assuming you have that) can sometimes fix this.

Finally, I want to mention that while the IPC, efficiency, and features of the 2620v4 are all improved over the 2667v2, it might actually be a bit of a downgrade for you. The 2620v4 is a lower TDP part and it doesn’t boost nearly as high as the 2667v2. I don’t think this is cause of your problems but it’s worth keeping in mind. Your performance expectation should be “the same or slightly less” rather than “greatly improved.”

There are lots of other things that could be contributing to the issue but this is where I’d start. If I have a VM that I want to be really fast and responsive I make sure that it has its own PCIe-passed-through GPU, NVMe drive, and NIC, and that its cores are pinned and isolcpu’d. Ideally the VM would have its own NUMA node or at least L3 cache region as well, but that’s not possible on your architecture (unless you have a dual-socket system and failed to mention it). Good luck!

ETA: Oh, there are more meltdown, etc. mitigation routines available in Broadwell than Ivy Bridge, and you might be running into slowdowns due to that as well. If you’re in a secure-enough environment where it’s safe to do so, you might consider adding mitigations=off to the kernel command line of the host and/or the guest. Please be aware that this carries a security risk.

Are you by any chance using USB passthrough with that guest VM?

no, I’m passing an entire pcie usb card.

dmesg

on the host (note, however, these messages are all on startup, not when the slowdown happens):

[543893.177117] audit: type=1400 audit(1663443549.184:36): apparmor="STATUS" operation="profile_load" profile="unconfined" name="libvirt-29cf73ad-d820-4fdc-b7c0-6f7a5ec2048b" pid=1026176 comm="apparmor_parser"
[543893.379400] audit: type=1400 audit(1663443549.388:37): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-29cf73ad-d820-4fdc-b7c0-6f7a5ec2048b" pid=1026179 comm="apparmor_parser"
[543893.578718] audit: type=1400 audit(1663443549.588:38): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-29cf73ad-d820-4fdc-b7c0-6f7a5ec2048b" pid=1026184 comm="apparmor_parser"
[543893.776973] audit: type=1400 audit(1663443549.784:39): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="libvirt-29cf73ad-d820-4fdc-b7c0-6f7a5ec2048b" pid=1026188 comm="apparmor_parser"
[543893.989723] audit: type=1400 audit(1663443550.000:40): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-29cf73ad-d820-4fdc-b7c0-6f7a5ec2048b" pid=1026200 comm="apparmor_parser"
[543905.676187] kvm_msr_ignored_check: 23 callbacks suppressed
[543905.676189] kvm [1026202]: ignored rdmsr: 0x4e data 0x0
[543905.688684] kvm [1026202]: ignored wrmsr: 0x4e data 0x2
[543905.694801] kvm [1026202]: ignored rdmsr: 0x4e data 0x0
[543905.705664] kvm [1026202]: ignored rdmsr: 0x1c9 data 0x0
[543905.711767] kvm [1026202]: ignored wrmsr: 0x1c9 data 0x3
[543905.717902] kvm [1026202]: ignored rdmsr: 0x1c9 data 0x0
[543905.724117] kvm [1026202]: ignored rdmsr: 0x1a6 data 0x0
[543905.730250] kvm [1026202]: ignored wrmsr: 0x1a6 data 0x11
[543905.736470] kvm [1026202]: ignored rdmsr: 0x1a6 data 0x0
[543905.742649] kvm [1026202]: ignored rdmsr: 0x1a7 data 0x0
[543911.622944] kvm_msr_ignored_check: 25 callbacks suppressed
[543911.622946] kvm [1026202]: ignored rdmsr: 0x8c data 0x0
[543911.635282] kvm [1026202]: ignored rdmsr: 0x8d data 0x0
[543911.646617] kvm [1026202]: ignored rdmsr: 0x8e data 0x0
[543911.657967] kvm [1026202]: ignored rdmsr: 0x8f data 0x0
[543911.687682] kvm [1026202]: ignored rdmsr: 0xc5 data 0x0
[543911.693736] kvm [1026202]: ignored rdmsr: 0xc6 data 0x0
[543911.699746] kvm [1026202]: ignored rdmsr: 0xc7 data 0x0
[543911.705744] kvm [1026202]: ignored rdmsr: 0xc8 data 0x0
[543911.711803] kvm [1026202]: ignored rdmsr: 0xc9 data 0x0
[543911.717864] kvm [1026202]: ignored rdmsr: 0xca data 0x0

i’ll have to reproduce on the guest, but i don’t remember seeing anything. will update.

I would start by diffing the original VM’s libvirt domain XML against the new one to see what has changed.

Scale maintains the xml (I can’t maintain it manually). It’s the exact same release version so i would expect the xml is identical save for pcie ids, etc…

edit
looks like it is exposed (just not via the gui). will dig into this!

What did the guest CPU utilization look like before the upgrades?

performance seemed similar except for the slowdown. i never had reason to check top (specifically, re: pipewire) on the old guest so I can’t answer that.

you should pin the cores and prevent the host from using them.

I don’t believe Scale offers any way to do this, but I’ll look into it.

I do have it configured as sockets="1" cores="4" threads="1" which doesn’t match the actual cpu’s topology… but this is how the old guest was defined also. Will try changing it.

GPU, make sure it’s properly excluded from the host OS

Scale does this automatically.
the host system has integrated graphics via its remote management, and doesn’t use the GT 1030. Will double-check uefi settings.

performance expectation should be “the same or slightly less” rather than “greatly improved.”

yeah, i did expect it to be a horizontal upgrade.
video playback is the most i’m asking of it.

you might consider adding mitigations=off

I hadn’t thought of that. good idea. yes, it’s a lan-only server.