High kernel process usage on CPU 0 isolcpus

So i got my KVM setup and everything seems to be working fine and my latency is in good ranges for the standard latency mon test.

But when i do the in depth test i get a few latency spikes on specific cores after running the program for a while.

I was reading that if you use isolcpus, that qemu is not able to balance the system process across the threads which is why all the usage goes to core 0?

Am i understanding this correctly? I was also reading that you have to use taskset to spread the qemu processes across the isolated cores.

This seems to be something most people miss from my research…

I did a test and also tried my same set up, but without isolcpus and the kernel usages is spread across the cpu cores (the Red bars in HTOP)

cpu1_kernel_processes

My windows 10 KVM is running on CPU0, i have done everything possible to make sure the host and guest are sharing no hardware but the usb controller to reduce latency as much as possible. Passing through an nvme helps latency alot from what i have see as you are not using a virtual hard drive file shared with the host. This also stopped my Audio crackling issues.

From what i have seen my latency is better than most on the standard latency test and i have no issues playing games at bare metal performance.

Here is my system setup.

Precision T7610
TSC Clocksource - Stable
HT Off
Kernel Compiled With Performance Governor)
Specter / Meltdown Disabled

Hardware on CPU1 (Gentoo Linux - Realtime kernel)
E5-2687w v2 Xeon
64GB ram
4x 1.6 TB HGST SAS SSDS Raid 10 btrfs
Quadro M4400
10GB broadcom SFP fiber card

Hardware on CPU0 (Windows 10 KVM)
E5-2687w v2 Xeon
64GB ram (Hugepages)
2TB NVME
GTX 1080
10GB broadcom SFP fiber card

I have spent 1 month benchmarking libvirt xml configs and reading red hat and libvirt xml documentation to create what is probably one of the best virtualized setups possible. I just seem to be stuck on the isolcpus issue not properly balancing the qemu host processes…
My libvirt xml

<domain type="kvm">
  <name>win10</name>
  <uuid>185f91df-a679-4672-9c11-27c3334762e0</uuid>
  <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/domain/1.0">
      <libosinfo:os id="http://microsoft.com/win/10"/>
    </libosinfo:libosinfo>
  </metadata>
  <memory unit="KiB">50331648</memory>
  <currentMemory unit="KiB">50331648</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size="2048" unit="KiB" nodeset="0"/>
    </hugepages>
    <nosharepages/>
    <locked/>
    <access mode="private"/>
    <allocation mode="immediate"/>
  </memoryBacking>
  <vcpu placement="static" cpuset="0-7">8</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="1"/>
    <vcpupin vcpu="2" cpuset="2"/>
    <vcpupin vcpu="3" cpuset="3"/>
    <vcpupin vcpu="4" cpuset="4"/>
    <vcpupin vcpu="5" cpuset="5"/>
    <vcpupin vcpu="6" cpuset="6"/>
    <vcpupin vcpu="7" cpuset="7"/>
    <emulatorpin cpuset="4-5"/>
    <iothreadpin iothread="1" cpuset="6-7"/>
  </cputune>
  <numatune>
    <memory mode="strict" nodeset="0"/>
    <memnode cellid="0" mode="strict" nodeset="0"/>
  </numatune>
  <os>
    <type arch="x86_64" machine="pc-q35-5.1">hvm</type>
    <loader readonly="yes" type="pflash">/usr/share/qemu/edk2-x86_64-code.fd</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win10_VARS.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state="on"/>
      <vapic state="on"/>
      <spinlocks state="on" retries="16384"/>
      <vpindex state="on"/>
      <synic state="on"/>
      <stimer state="on"/>
      <reset state="on"/>
      <vendor_id state="on" value="elitekvm"/>
      <frequencies state="on"/>
      <reenlightenment state="on"/>
      <tlbflush state="on"/>
      <ipi state="off"/>
      <evmcs state="off"/>
    </hyperv>
    <kvm>
      <hidden state="on"/>
    </kvm>
    <pmu state="off"/>
    <vmport state="off"/>
    <ioapic driver="kvm"/>
  </features>
  <cpu mode="host-passthrough" check="full" migratable="on">
    <topology sockets="1" dies="1" cores="8" threads="1"/>
    <cache mode="passthrough"/>
    <feature policy="require" name="rdtscp"/>
    <feature policy="require" name="x2apic"/>
    <numa>
      <cell id="0" cpus="0-7" memory="50331648" unit="KiB"/>
    </numa>
  </cpu>
  <clock offset="localtime">
    <timer name="hypervclock" present="yes"/>
    <timer name="rtc" tickpolicy="catchup"/>
    <timer name="pit" tickpolicy="delay"/>
    <timer name="hpet" present="no"/>
    <timer name="tsc" present="yes"/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled="no"/>
    <suspend-to-disk enabled="no"/>
  </pm>
  <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <controller type="usb" index="0" model="qemu-xhci" ports="15">
      <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
    </controller>
    <controller type="sata" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2"/>
    </controller>
    <controller type="pci" index="0" model="pcie-root"/>
    <controller type="pci" index="1" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="1" port="0x8"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x0" multifunction="on"/>
    </controller>
    <controller type="pci" index="2" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="2" port="0x9"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x1"/>
    </controller>
    <controller type="pci" index="3" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="3" port="0xa"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x2"/>
    </controller>
    <controller type="pci" index="4" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="4" port="0xb"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x3"/>
    </controller>
    <controller type="pci" index="5" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="5" port="0xc"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x4"/>
    </controller>
    <controller type="pci" index="6" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="6" port="0xd"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x5"/>
    </controller>
    <controller type="pci" index="7" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="7" port="0xe"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x6"/>
    </controller>
    <controller type="pci" index="8" model="pcie-root-port">
      <model name="pcie-root-port"/>
      <target chassis="8" port="0xf"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x01" function="0x7"/>
    </controller>
    <input type="mouse" bus="ps2"/>
    <input type="keyboard" bus="ps2"/>
    <hostdev mode="subsystem" type="usb" managed="yes">
      <source>
        <vendor id="0x046d"/>
        <product id="0xc077"/>
      </source>
      <address type="usb" bus="0" port="1"/>
    </hostdev>
    <hostdev mode="subsystem" type="usb" managed="yes">
      <source>
        <vendor id="0x0c45"/>
        <product id="0x5004"/>
      </source>
      <address type="usb" bus="0" port="2"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x03" slot="0x00" function="0x1"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x03" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x04" slot="0x00" function="0x1"/>
      </source>
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0"/>
    </hostdev>
    <hostdev mode="subsystem" type="pci" managed="yes">
      <source>
        <address domain="0x0000" bus="0x02" slot="0x00" function="0x0"/>
      </source>
      <boot order="1"/>
      <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0"/>
    </hostdev>
    <memballoon model="none"/>
  </devices>
</domain>

I’ve encountered problems like this with IRQ balancing before, mostly networking related in my case, though it could still be relevant to you. Drivers sometimes prefer the first CPU for IRQ for some reason.

There’s this daemon irqbalance that’s available on basically all distros, try that, might work for you. If that does not work, you might need to dive into some configuration yourself.

If you spent month on it I doubt you will hear anything new.
However I came across cpu-pm flag for quemu, not sure if you tried that:

   <qemu:arg value='-overcommit'/>
   <qemu:arg value='cpu-pm=on'/>

Supposed to decrease latency to VM at a cost of increased latency for host.
Here’s info on commit: https://www.reddit.com/r/VFIO/comments/cmgmt0/a_new_option_for_decreasing_guest_latency_cpupmon/

Also you used this twice:

I would very much like to hear what you regard as “standard” in this case. Is there RFC for that or something?
Or you just built your own toolstack for testing?. If yes then please share.

overcommit i believe helps when you are running more than 1 VM and it ensure resources are being properly pooled to the “primary” VM. At least this is my understand. Thanks for the info though.

The issue iam talking about is isolcpus related. If your pinning your Cpus with Isolcpus you most likely arnt doing it correctly.

The issue iam talking about is mostly over looked when using the isolcpus kernel option. Based on my research it requires a small script to be run using task set and or chrt to properly pin the qemu process across the guest CPUs and host CPUs. The problem rises from the Linux Task Scheduler being turned off when you use the kernel option. Most people dont take notice because of the performance improvement they are still getting from isolcpus but there is still 1 more step to take. If you look at my HTOP picture, the yellow bars mean virtualization process usage, the blue is low priority and the Red is kernel based usage. If you look at CPU 0, the kernel process of qemu is not being properly spread out among the other cores. I can almost guarantee you that almost everyone on this forum is running with the same issue. It is commonly over looked and ignored…

Here is a small suse comment that mentions the issue. (cant post links, not sure why considering this is a tech based website that requires information sharing)

As far as latency testing goes, i use latency mon. Iam trying to create a guide for everyone on how to properly setup a real time kernel based KVM using the correct fifo scheduler options and how to properly maintain the TSC clocksource to maintain the lowest possible latency.

Considering almost every KVM guide i have gone over is missing vital and important information.

I have 2 questions i need resolved before i start working on the guide.

1 is the small script for the taskset and chrt. The other is properly configuring the real time kernel scheduler to maintain the lowest possible latency.

This requires a search of all qemu processes and than properly assigning them across the guest cores as the scheduler is disabled by default when you use isolcpus this is not done automatically. If you dont have a script properly assigning the qemu processes across the guest cores, you arnt doing it right and should use cset shield instead of isolcpus.

This requires a bit of advanced bash scripting skills which i do not have. I was able to locate a libvirt hooks script for this , but it doesn’t seem to be executing correctly.

Iam currently running with a real time kernel and i was able to maintain another latency drop. But the configuration is still not completed. I need help with the real time scheduler configuration.

I did a test and configured it for Google chrome and let me tell you a properly configured real time kernel on a specified process is a site to see.

I have been looking over some Red Hat documentation for real time virtual machine setups and there latency is multiple times lower than what we consider “good”.

This is because they are properly using the real time kernel’s scheduler correctly and setting virtual machines process correctly across the guest cores when using Isolcpus.

Than again these are paid professionals that deploy virtual machines in the datacenter industry on a daily basis.

My proof reading skills suck XD sorry.

Usually, maybe, but in this case documentation specifically says its helps in case one VM, and works best when cpu is NOT overcommited. It doesn’t say anything about impact on other VM-s.

I feel your pain. Just in my case its usually Xen when I try to find documentation on some obscure feature. And docs for Xen are even more scarce than KVM.

If this script do what you need to do with core assigment (using taskset and chrt?) , just don’t work from hooks, then maybe post it here. Because that’s like half job done and probably plenty of people can help with bash, but don’t really want to figure out, what needs to be done in script you need.

Thanks for the active replies brother.

Main problem of using isolcpus

Tasks are not load balanced on isolated CPUs , including qemu-system-x86

Quoted from another website

The thing that always catches people out is that it’s easy to end up with all of your vCPU tasks running on the same CPU !

I need a script that can be used as libvirt hook that searches for all the qemu processes and uses task set to spread the qemu processes across the isolated CPU cores.

Like i said most people that have isolated their cpus for their KVM failed to do this part.

The Windows virtual machine might be running across your isolated CPUS, BUT QEMU IS NOT.

Most of the QEMU processes run on 1 core by default when using isolated kernel option.

The Linux scheduler usually does this automatically.

This also causes issues with interrupt handling.

Most people are happy with the isolated CPUS because of the lowered latency, but they arnt finished yet.

Like i said this step is usually over looked , even by the pros.

What script needs to do.
Start as a libvirt hook. (In libvirt, hooks folder)

Search Qemu processes
Evenly pin qemu processes with taskset across isolated CPUs.

Example
nano /etc/libvirt/hooks/qemu

If i can get help with this part, i promise you guys a KVM guide you will love me for.

1 Like

Oh, you meant that you found “location where to put script”, i thought you found actual script code that does something and just don’t work as quemu hook.

But what I meant, if there’s no script, that you show how manually you do pin “properly” those processes. Then it will be pretty easy to translate into script.
Can you do that manually? If yes, then do that and post commands that your’re using.

Problem with this is that i cant post a link to original site. Also because most of us are using virt-manager. Our script needs to be a bit different. We can put the script in the hooks folder under /etc/libvirt/hooks

This way it is automatically run when you launch the VM.

This script resolves the issue of the missing task scheduler on the isolated cpu cores and properly pins the QEMU threads.

This is a simple shell script which uses the debug-threads QEMU argument and taskset to find vCPU threads and pin them to an affinity variable set elsewhere in the script.

There is a chance i can find a script that does what we need, but its going to take alot of Google. The issue with isolcpus is already pretty deep to begin with.

Below is code taken from null-src. Credit to them

#!/bin/bash

# clear options
OPTS=""

# set vm name
NAME="PARASITE"

# host affinity list
THREAD_LIST="8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31"

# qemu options
OPTS="$OPTS -name $NAME,debug-threads=on"
OPTS="$OPTS -enable-kvm"
OPTS="$OPTS -cpu host"
OPTS="$OPTS -smp 16,cores=8,sockets=1,threads=2"
OPTS="$OPTS -m 32G"
OPTS="$OPTS -drive if=virtio,format=raw,aio=threads,file=/vms/disk-images/windows-10.img"

function run-vm {
# specify which host threads to run QEMU parent and worker processes on
taskset -c 0-7,16-32 qemu-system-x86_64 $OPTS
}

function set-affinity {
# sleep for 20 seconds while QEMU VM boots and vCPU threads are created
sleep 20 &&
HOST_THREAD=0
# for each vCPU thread PID
for PID in $(pstree -pa $(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | awk -F',' '{print $2}' | awk '{print $1}') | grep CPU |  pstree -pa $(pstree -pa $(pidof qemu-system-x86_64) | grep $NAME | cut -d',' -f2 | cut -d' ' -f1) | grep CPU | sort | awk -F',' '{print $2}')
do
    let HOST_THREAD+=1
    # set each vCPU thread PID to next host CPU thread in THREAD_LIST
    echo "taskset -pc $(echo $THREAD_LIST | cut -d',' -f$HOST_THREAD) $PID" | bash
done
}

set-affinity &
run-vm

Yeah, that’s exactly what I meant earlier. Also I see problem why this wouldn’t work as hook for libvirt as is.

First problem to solve is to find out if you can pin cpus like taskset does here:

taskset -c 0-7,16-32 qemu-system-x86_64 

but do it AFTER vm is already running. It might be available to set up in libvirt’s xml, cant remember now.

Rest seems pretty straightforward. I will have look at this, unless someone beats me to it :slight_smile: But don’t get your hopes up, because it seems way too easy, so there may be some quirk that prevents it from working.

NVM just remembered that you posted your xlm. So i’m gonna just use that.

The taskset needs to be used outside of libvirt, vcpus are not able properly pin them selfs across guest cpus with the qemu system process. The command you posted is what you need, but i believe it needs to have all the pids of the process to work. I havnt tried this has a libvirt hook though… so i think iam going to give it a try. But you see where iam stuck at now! We need more fellow forum members on this issue. Considering most people arnt doing it properly with isolcpus.

Yeah, I know that’s the first thing I said when I saw your script.
But you posted pinning in your XML:

<cputune>
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="1"/>
    <vcpupin vcpu="2" cpuset="2"/>
    <vcpupin vcpu="3" cpuset="3"/>
    <vcpupin vcpu="4" cpuset="4"/>
    <vcpupin vcpu="5" cpuset="5"/>
    <vcpupin vcpu="6" cpuset="6"/>
    <vcpupin vcpu="7" cpuset="7"/>

So you mean this doesn’t work?

It works half way, but when using isolcpus to pin the cpu threads, you still need taskset to correctly pin the kernel threads. The Windows threads get pinned properly, but not the kernel threads. This is my understanding from the documentation i have read on Redhat and arch wiki guides.

Most people arnt doing it correctly.

Let me see if i can give you a real world example 1 sec.

The emulator thread also needs the linux scheduler which is disabled.

So yes, most people you see here are not doing it correctly.

Ok, any example will be helpful, because now I have doubt because:

Yet cpu seems to be configured as smp=16, cores=8

And later

And later taskset is going trough that thread list assigning thread number to PID:

So I have hard time to reconcile why you would try to pin them to cores 8-15 and 24-31 since you have only 16 threads on cpu…

If THREAD_LIST was something like:
“4,5,6,7,12,13,14,15”
Then all would be clear

This was an example i posted from another website to show you what i meant by using taskset to properly pin your cpus. Its not specifically for my cpus and script can be written in many ways. Iam just not a pro at bash, but i may give it a try here in a sec and see what i can come up with for us. Are you using isolcpus too? Have you finished getting the most performance out of your KVM you can possibly get. I could help you with getting more performance if need be. If you can help me find a solution to this :slight_smile:

Lets focus on one thing :slight_smile:

I almost have working hook, but only for second part, second taskset I just posted:

I have list of PIDs too and I can assign them. But this list in example is confusing.

Without bash.
I made test VM with 4 cpus, they show as:

qemu-system-x86,450625 -name guest=usbtest,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-11-usbtest/master-key.aes -machinepc-i440fx-5.1,accel=kvm,usb=off,vmport=off
  ├─{qemu-system-x86},450639
  ├─{qemu-system-x86},450640
  ├─{qemu-system-x86},450641
  ├─{qemu-system-x86},450643
  ├─{qemu-system-x86},450644
  ├─{qemu-system-x86},450645
  ├─{qemu-system-x86},450646
  ├─{qemu-system-x86},450647
  └─{qemu-system-x86},450649

So should I assing those PIDS to cores 4-7 and 12-15? Is this correct?

Lets keep focus on one thing at a time :slight_smile:

I deleted my questions because I found your source:

Im gonna read it and get back to you later

Nice job dude. Ya you need to pin the tasks evenly across the cores.

Please run some benchmarks and give latency mon a go to.

The script neeeds to search for those pids and evenly pin them.

Can be confusing because pids change. Do you have a discord i can contact you on? It looks like we could work together on a nice guide for people.

No need to delete your thread, that was some good information. People could use that.

Yeah I undeleted that, since you already replied with quotes :slight_smile: