Return to Level1Techs.com

Memory and Core configuration for Passthrough on Threadripper

I thought I’d document this real quick, since I’ve spent a bit of time researching due to a lack of readily available info.

Preface

First thing’s first, this assumes you’ve already got a passing knowledge of Passthrough concepts. (If not, you can read my (slightly outdated) guide, found here)

For this guide, I’ll be assuming you’re using a 1950x or 2950x, with 2 numa nodes. I’m sure this guide can be applied to the 2990wx or other 4 node systems, but I don’t have access to that hardware to test. If someone wants to touch on how to apply that, I’d be happy to chat about that.

Goals

In order to get the most performance out of the system, it’s helpful to reduce the amount of Infinity Fabric communication, which comes at a latency cost. To do this, we want to tell the VM to use cores, memory and PCI devices that are all attached to the same numa node.

Understanding your hardware topology

First thing when setting up a passthrough VM on a multi-node system is that you need to understand where all your hardware is showing up. while lspci is helpful for identifying what’s what, you can’t really see where it’s all showing up, and which node your devices and memory are attached to. To do that, you need lstopo. In the RHEL world, that’s provided by the hwloc-gui package.

Running lstopo on my system gives me the following:

So let’s talk a bit about understanding the topology diagram we’re looking at.

First thing’s first, we have the NUMANode. Think of this as the die itself. If you were to delid your threadripper, you’d see 4 of them. Only two of them (on the non W skus) are active.

From there, you can see the L3 cache. Each CCX (Compute Complex, or set of 4 cores on the Zeppelin die) has 8MB of L3 cache, that’s how you can identify your CCX. Just below your L3, you have L2 and L1 cache.

Below all the cache are your cores. Each core has two Process Units, or PUs. Those are your threads. If you want to pass through both threads, you need to use those numbers.

Next is memory. Memory is a bit complex because the OS doesn’t share the individual DIMMS or DRAM modules with you, it just shows you how much you have attached to each core. In the pink NUMANode object, you have the amount of memory attached to each node in parentheses. We’ll talk about how to specifically target this memory later.

On to IO devices, PCI Express. To the right side of the Cache and Cores, you can see a tree diagram. The nodes on the tree have PCI IDs on them and if relevant, a device identifier as well. Network devices have an en or wl identifier, storage has a sd or nvme identifier and video cards that are attached to valid drivers have a renderD identifier. At least that’s how it works on my system.

As you can see by the 10de PCI devices, I have two Nvidia GPUs on my system. the 10de:1c02 GPU is a 1060, the 10de:1b06 GPU is a 1080 ti. Obviously, we’ll be attaching the 1080 to the VM. That means we want to use cores and memory from node 0.

At first, I thought the numbers next to the lines indicated PCIe lanes, but after doing some investigating, that can’t be right. I’ll do more research and figure this one out. (or if someone knows, holla at a brotha!)

Core configuration

CPU core configuration is pretty simple. All you need to do is configure pinning. I use the following configuration. You may change this configuration to fit your needs:

<vcpu placement='static'>16</vcpu>
  <iothreads>4</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='16'/>
    <vcpupin vcpu='2' cpuset='1'/>
    <vcpupin vcpu='3' cpuset='17'/>
    <vcpupin vcpu='4' cpuset='2'/>
    <vcpupin vcpu='5' cpuset='18'/>
    <vcpupin vcpu='6' cpuset='3'/>
    <vcpupin vcpu='7' cpuset='19'/>
    <vcpupin vcpu='8' cpuset='4'/>
    <vcpupin vcpu='9' cpuset='20'/>
    <vcpupin vcpu='10' cpuset='5'/>
    <vcpupin vcpu='11' cpuset='21'/>
    <vcpupin vcpu='12' cpuset='6'/>
    <vcpupin vcpu='13' cpuset='22'/>
    <vcpupin vcpu='14' cpuset='7'/>
    <vcpupin vcpu='15' cpuset='23'/>
    <emulatorpin cpuset='0-1,16-17'/>
    <iothreadpin iothread='1' cpuset='2,18'/>
    <iothreadpin iothread='2' cpuset='3,19'/>
    <iothreadpin iothread='3' cpuset='4,20'/>
    <iothreadpin iothread='4' cpuset='5,21'/>
  </cputune>

All pretty simple. We pin the vCPUs to physical cores, set up iothreads that share those physical cpus and threads, and tell the emulator to run on specific cores as well. Notice that I’ve set everything to run on threads 0-7 and 16-23? That’s because of the topology above, where it shows threads 0-7 and 16-23 on node 0, which is where my GPU is attached.

Memory configuration

Memory configuration was a beast to figure out. If you want to use hugepages, you need to somehow tell the system to assign those hugepages to the specific numa node you want to use. You can’t just sysctl -w vm.nr_hugepages=8192 to a specific numa node, but there’s another way.

/sys/devices/system/node is a system run config directory for that holds your numa nodes. Within each node, there’s a hugepages/hugepages-2048kB/nr_hugepages parameter file which you can write a number to.

In my case, I do the following. Replace node0 with whatever node you’re using.

echo 8400 > /sys/devices/system/node/node0/hugepages/hugepages2048kB/nr_hugepages

sidenote: I have docker containers running on this system that like to claim approx 150 hugepages for themselves. If you’re running into issues with your VM not starting due to hugepage availability, you might need to add a few more. That’s why I allocate 8400

I have this command in a /usr/local/bin file called by a systemd unit file on boot.

Once this command is done, you can cat /proc/meminfo | grep Huge to verify that the system has allocated your pages. Should look something like this:

$ cat /proc/meminfo | grep Huge
AnonHugePages:      2048 kB
ShmemHugePages:        0 kB
HugePages_Total:    8400
HugePages_Free:      177
HugePages_Rsvd:      175
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        17203200 kB

Now that you’ve got your memory configuration setup, configure your VM to use hugepages:

<memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <memoryBacking>
    <hugepages/>
  </memoryBacking>

Once that’s done, you should have a system that runs everything on the same numa node and you shouldn’t suffer from the latency issues of cross-node communication.

There are many other optimizations that can be done as well. If you’re interested in that, let me know and I’d be happy to write a guide on that as well.

6 Likes

I am interested in the optimizations. Done a lot of it for my 1920x and would like to increase my collection.

Few questions:
Is there no need to mount the hugepages?
Why not use 1G hugepages? In your case it would also remove the overlap with docker.
How important is to provide correct cache values & topology to guest? Had to play a lot with it, even wrote qemu code for topoext before AMD took over. They still left out code for cache passthrough. Its back in qemu 4.1 . Someone mentioned that FarCry 5 runs better with EPYC, that has realistic cache, instead of host-passthrough. 1920x needed to be split to 2 socket systems for L3 to align.

Okay, wow, I completely missed this post. I think I took a vacation after posting this. In the interest of completeness, I’m going to break the thread rules and answer these questions.

2MB hugepages don’t need to be mounted.

I chose 2MB pages to avoid mounting. I don’t like mounting them, it seems hacky. :confused:

Yes, it would avoid the issues with docker overlap, but I’m actually okay with docker using them as it needs.

I honestly didn’t really notice any difference, but I wanted to be sure that I was not doing unnecessary Infinity Fabric transfers. More an exercise in over-optimization than anything else. Some people might notice something, but I’m not particularly sensitive to latency, so I don’t really notice a few ms here and there. (this is also what makes me terrible at multiplayer shooter games lol)