7950X3D + Proxmox CPU Core Pinning = Gaming VM w/ V-Cache?

yes just realized that this is your post :slight_smile:

yes with Kernel 6.2 and 6.3, I’m also waiting for the fix

1 Like

it works for me with the kernel parameter “amdgpu.sg_display=0”

1 Like

Thanks for that, will try it today!

do you use 16 cores or 8 cores 16 threads? I get 97% single thread and exactly 50% multi-thread performance.

I use 4 cores per CCD, I also tried to use only one CCD, but the latency doesn’t get any better, currently I’m at about 60ns memory latency with Aida64.

  <vcpu placement='static' current='16'>32</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='1'/>
    <vcpupin vcpu='1' cpuset='17'/>
    <vcpupin vcpu='2' cpuset='9'/>
    <vcpupin vcpu='3' cpuset='25'/>
    <vcpupin vcpu='4' cpuset='2'/>
    <vcpupin vcpu='5' cpuset='18'/>
    <vcpupin vcpu='6' cpuset='10'/>
    <vcpupin vcpu='7' cpuset='26'/>
    <vcpupin vcpu='8' cpuset='3'/>
    <vcpupin vcpu='9' cpuset='19'/>
    <vcpupin vcpu='10' cpuset='11'/>
    <vcpupin vcpu='11' cpuset='27'/>
    <vcpupin vcpu='12' cpuset='4'/>
    <vcpupin vcpu='13' cpuset='20'/>
    <vcpupin vcpu='14' cpuset='12'/>
    <vcpupin vcpu='15' cpuset='28'/>
    <emulatorpin cpuset='7,23'/>
    <iothreadpin iothread='1' cpuset='9,25'/>
  </cputune>
  <os firmware='efi'>
    <type arch='x86_64' machine='pc-q35-8.0'>hvm</type>
    <firmware>
      <feature enabled='no' name='enrolled-keys'/>
      <feature enabled='yes' name='secure-boot'/>
    </firmware>
    <loader readonly='yes' secure='yes' type='pflash'>/usr/share/edk2/x64/OVMF_CODE.secboot.4m.fd</loader>
    <nvram template='/usr/share/edk2/x64/OVMF_VARS.4m.fd'>/var/lib/libvirt/qemu/nvram/win11-offg_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv mode='custom'>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vpindex state='on'/>
      <synic state='on'/>
      <stimer state='on'/>
      <reset state='on'/>
      <vendor_id state='on' value='1234567890ab'/>
      <frequencies state='on'/>
    </hyperv>
    <pmu state='off'/>
    <vmport state='off'/>
    <smm state='on'/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='off'>
    <topology sockets='1' dies='2' cores='8' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
    <feature policy='require' name='invtsc'/>
  </cpu>

1 Like

Works! Finally running on Kernel 6.3 which supposedly has optimizations for X3D CPUs

Cinebench single thread always run way slower in my vm. Same as Geekbench 6. 85% is optimistic in my case.
You get 60ns latency inside vm? That is amazing. I am about 60ns on the bare metal.

Yes inside the VM.
You’re right, I hadn’t tested R23 single thread yet, single thread R23 It’s not quite there yet.
I am using an 40 bucks air cooler right now, it’s a good one, but when I got my AIO next week, I might get to 2000 points in single thread.

This is again a different configuration, it has the same 1% lows (Metro Exodus) as a single CCD configuration but better multi tread performance.

 <vcpu placement='static' current='16'>32</vcpu>
  <vcpus>
    <vcpu id='0' enabled='yes' hotpluggable='no'/>
    <vcpu id='1' enabled='yes' hotpluggable='yes'/>
    <vcpu id='2' enabled='yes' hotpluggable='yes'/>
    <vcpu id='3' enabled='yes' hotpluggable='yes'/>
    <vcpu id='4' enabled='yes' hotpluggable='yes'/>
    <vcpu id='5' enabled='yes' hotpluggable='yes'/>
    <vcpu id='6' enabled='yes' hotpluggable='yes'/>
    <vcpu id='7' enabled='yes' hotpluggable='yes'/>
    <vcpu id='8' enabled='no' hotpluggable='yes'/>
    <vcpu id='9' enabled='no' hotpluggable='yes'/>
    <vcpu id='10' enabled='no' hotpluggable='yes'/>
    <vcpu id='11' enabled='no' hotpluggable='yes'/>
    <vcpu id='12' enabled='no' hotpluggable='yes'/>
    <vcpu id='13' enabled='no' hotpluggable='yes'/>
    <vcpu id='14' enabled='no' hotpluggable='yes'/>
    <vcpu id='15' enabled='no' hotpluggable='yes'/>
    <vcpu id='16' enabled='no' hotpluggable='yes'/>
    <vcpu id='17' enabled='no' hotpluggable='yes'/>
    <vcpu id='18' enabled='no' hotpluggable='yes'/>
    <vcpu id='19' enabled='no' hotpluggable='yes'/>
    <vcpu id='20' enabled='no' hotpluggable='yes'/>
    <vcpu id='21' enabled='no' hotpluggable='yes'/>
    <vcpu id='22' enabled='no' hotpluggable='yes'/>
    <vcpu id='23' enabled='no' hotpluggable='yes'/>
    <vcpu id='24' enabled='yes' hotpluggable='yes'/>
    <vcpu id='25' enabled='yes' hotpluggable='yes'/>
    <vcpu id='26' enabled='yes' hotpluggable='yes'/>
    <vcpu id='27' enabled='yes' hotpluggable='yes'/>
    <vcpu id='28' enabled='yes' hotpluggable='yes'/>
    <vcpu id='29' enabled='yes' hotpluggable='yes'/>
    <vcpu id='30' enabled='yes' hotpluggable='yes'/>
    <vcpu id='31' enabled='yes' hotpluggable='yes'/>
  </vcpus>
  <iothreads>1</iothreads>
  <cputune>
    <emulatorpin cpuset='8,24'/>
    <iothreadpin iothread='1' cpuset='6,22'/>
  </cputune>
  <os firmware='efi'>
    <type arch='x86_64' machine='pc-q35-8.0'>hvm</type>
    <firmware>
      <feature enabled='no' name='enrolled-keys'/>
      <feature enabled='yes' name='secure-boot'/>
    </firmware>
    <loader readonly='yes' secure='yes' type='pflash'>/usr/share/edk2/x64/OVMF_CODE.secboot.4m.fd</loader>
    <nvram template='/usr/share/edk2/x64/OVMF_VARS.4m.fd'>/var/lib/libvirt/qemu/nvram/win11-offg-clone2_VARS.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv mode='custom'>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vpindex state='on'/>
      <synic state='on'/>
      <stimer state='on'/>
      <reset state='on'/>
      <vendor_id state='on' value='1234567890ab'/>
      <frequencies state='on'/>
    </hyperv>
    <pmu state='off'/>
    <vmport state='off'/>
    <smm state='on'/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='off'>
    <topology sockets='1' dies='2' cores='8' threads='2'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='disable' name='monitor'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>
  </clock>

1 Like

Yes, you look like thermal throttled severely.
I am only passing 16 threads, one from each physical core, yet better multithread performance.
I am on $70 air cooler.

  <cputune>
    <vcpupin vcpu="0" cpuset="0"/>
    <vcpupin vcpu="1" cpuset="1"/>
    <vcpupin vcpu="2" cpuset="2"/>
    <vcpupin vcpu="3" cpuset="3"/>
    <vcpupin vcpu="4" cpuset="4"/>
    <vcpupin vcpu="5" cpuset="5"/>
    <vcpupin vcpu="6" cpuset="6"/>
    <vcpupin vcpu="7" cpuset="7"/>
    <vcpupin vcpu="8" cpuset="8"/>
    <vcpupin vcpu="9" cpuset="9"/>
    <vcpupin vcpu="10" cpuset="10"/>
    <vcpupin vcpu="11" cpuset="11"/>
    <vcpupin vcpu="12" cpuset="12"/>
    <vcpupin vcpu="13" cpuset="13"/>
    <vcpupin vcpu="14" cpuset="14"/>
    <vcpupin vcpu="15" cpuset="15"/>
    <iothreadpin iothread="1" cpuset="24-27"/>
    <iothreadpin iothread="2" cpuset="28-31"/>
  </cputune>

I cannot figure out why my single thread score is pretty low comparing yours.

Some one mentioned this high score is due to the skew of emulated clock. I post here anyway for your reference. GPU: RTX 4070 TI

I tried to recreate your results, I just get a slightly worse memory latency, but everything else is about the same.
Have you disabled SMT in the bios and which power management settings are you using in Windows?

test config

 <memoryBacking>
    <hugepages/>
  </memoryBacking>
  <vcpu placement='static'>16</vcpu>
  <iothreads>1</iothreads>
  <cputune>
    <vcpupin vcpu='0' cpuset='0'/>
    <vcpupin vcpu='1' cpuset='1'/>
    <vcpupin vcpu='2' cpuset='2'/>
    <vcpupin vcpu='3' cpuset='3'/>
    <vcpupin vcpu='4' cpuset='4'/>
    <vcpupin vcpu='5' cpuset='5'/>
    <vcpupin vcpu='6' cpuset='6'/>
    <vcpupin vcpu='7' cpuset='7'/>
    <vcpupin vcpu='8' cpuset='8'/>
    <vcpupin vcpu='9' cpuset='9'/>
    <vcpupin vcpu='10' cpuset='10'/>
    <vcpupin vcpu='11' cpuset='11'/>
    <vcpupin vcpu='12' cpuset='12'/>
    <vcpupin vcpu='13' cpuset='13'/>
    <vcpupin vcpu='14' cpuset='14'/>
    <vcpupin vcpu='15' cpuset='15'/>
    <iothreadpin iothread='1' cpuset='16'/>
  </cputune>
  <os firmware='efi'>
    <type arch='x86_64' machine='pc-q35-8.0'>hvm</type>
    <firmware>
      <feature enabled='no' name='enrolled-keys'/>
      <feature enabled='yes' name='secure-boot'/>
    </firmware>
    <loader readonly='yes' secure='yes' type='pflash'>/usr/share/edk2/x64/OVMF_CODE.secboot.4m.fd</loader>
    <nvram template='/usr/share/edk2/x64/OVMF_VARS.4m.fd'>/var/lib/libvirt/qemu/nvram/win11-offg-clone4_VARS.fd</nvram>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv mode='custom'>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vpindex state='on'/>
      <synic state='on'/>
      <stimer state='on'/>
      <reset state='on'/>
      <vendor_id state='on' value='1234567890ab'/>
      <frequencies state='on'/>
    </hyperv>
    <pmu state='off'/>
    <vmport state='off'/>
    <smm state='on'/>
  </features>
  <cpu mode='host-passthrough' check='none' migratable='on'>
    <topology sockets='1' dies='1' cores='16' threads='1'/>
    <cache mode='passthrough'/>
    <feature policy='require' name='topoext'/>
    <feature policy='require' name='invtsc'/>
    <feature policy='require' name='x2apic'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    <timer name='hypervclock' present='yes'/>
  </clock>

Result

I would have guessed that this config uses 16 cores, as Aida64 shows, but the host only uses 8 cores

edit: BS, this are the cores, 16-31 are the the siblings, but coreinfo shows only 32MB L3 cache for your config…??!

Processor signature: 00A60F12

Logical to Physical Processor Map:
*---------------  Physical Processor 0
-*--------------  Physical Processor 1
--*-------------  Physical Processor 2
---*------------  Physical Processor 3
----*-----------  Physical Processor 4
-----*----------  Physical Processor 5
------*---------  Physical Processor 6
-------*--------  Physical Processor 7
--------*-------  Physical Processor 8
---------*------  Physical Processor 9
----------*-----  Physical Processor 10
-----------*----  Physical Processor 11
------------*---  Physical Processor 12
-------------*--  Physical Processor 13
--------------*-  Physical Processor 14
---------------*  Physical Processor 15

Logical Processor to Socket Map:
****************  Socket 0

Logical Processor to NUMA Node Map:
****************  NUMA Node 0

No NUMA nodes.

Logical Processor to Cache Map:
**--------------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**--------------  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
**--------------  Unified Cache       0, Level 2,    1 MB, Assoc   8, LineSize  64
****************  Unified Cache       1, Level 3,   32 MB, Assoc  16, LineSize  64
--**------------  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
--**------------  Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
--**------------  Unified Cache       2, Level 2,    1 MB, Assoc   8, LineSize  64
----**----------  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
----**----------  Instruction Cache   2, Level 1,   32 KB, Assoc   8, LineSize  64
----**----------  Unified Cache       3, Level 2,    1 MB, Assoc   8, LineSize  64
------**--------  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
------**--------  Instruction Cache   3, Level 1,   32 KB, Assoc   8, LineSize  64
------**--------  Unified Cache       4, Level 2,    1 MB, Assoc   8, LineSize  64
--------**------  Data Cache          4, Level 1,   32 KB, Assoc   8, LineSize  64
--------**------  Instruction Cache   4, Level 1,   32 KB, Assoc   8, LineSize  64
--------**------  Unified Cache       5, Level 2,    1 MB, Assoc   8, LineSize  64
----------**----  Data Cache          5, Level 1,   32 KB, Assoc   8, LineSize  64
----------**----  Instruction Cache   5, Level 1,   32 KB, Assoc   8, LineSize  64
----------**----  Unified Cache       6, Level 2,    1 MB, Assoc   8, LineSize  64
------------**--  Data Cache          6, Level 1,   32 KB, Assoc   8, LineSize  64
------------**--  Instruction Cache   6, Level 1,   32 KB, Assoc   8, LineSize  64
------------**--  Unified Cache       7, Level 2,    1 MB, Assoc   8, LineSize  64
--------------**  Data Cache          7, Level 1,   32 KB, Assoc   8, LineSize  64
--------------**  Instruction Cache   7, Level 1,   32 KB, Assoc   8, LineSize  64
--------------**  Unified Cache       8, Level 2,    1 MB, Assoc   8, LineSize  64

Logical Processor to Group Map:
****************  Group 0

that’s my config

Logical to Physical Processor Map:
**--------------  Physical Processor 0 (Hyperthreaded)
--**------------  Physical Processor 1 (Hyperthreaded)
----**----------  Physical Processor 2 (Hyperthreaded)
------**--------  Physical Processor 3 (Hyperthreaded)
--------**------  Physical Processor 4 (Hyperthreaded)
----------**----  Physical Processor 5 (Hyperthreaded)
------------**--  Physical Processor 6 (Hyperthreaded)
--------------**  Physical Processor 7 (Hyperthreaded)

Logical Processor to Socket Map:
****************  Socket 0

Logical Processor to NUMA Node Map:
****************  NUMA Node 0

No NUMA nodes.

Logical Processor to Cache Map:
**--------------  Data Cache          0, Level 1,   32 KB, Assoc   8, LineSize  64
**--------------  Instruction Cache   0, Level 1,   32 KB, Assoc   8, LineSize  64
**--------------  Unified Cache       0, Level 2,    1 MB, Assoc   8, LineSize  64
********--------  Unified Cache       1, Level 3,   32 MB, Assoc  16, LineSize  64
--**------------  Data Cache          1, Level 1,   32 KB, Assoc   8, LineSize  64
--**------------  Instruction Cache   1, Level 1,   32 KB, Assoc   8, LineSize  64
--**------------  Unified Cache       2, Level 2,    1 MB, Assoc   8, LineSize  64
----**----------  Data Cache          2, Level 1,   32 KB, Assoc   8, LineSize  64
----**----------  Instruction Cache   2, Level 1,   32 KB, Assoc   8, LineSize  64
----**----------  Unified Cache       3, Level 2,    1 MB, Assoc   8, LineSize  64
------**--------  Data Cache          3, Level 1,   32 KB, Assoc   8, LineSize  64
------**--------  Instruction Cache   3, Level 1,   32 KB, Assoc   8, LineSize  64
------**--------  Unified Cache       4, Level 2,    1 MB, Assoc   8, LineSize  64
--------**------  Data Cache          4, Level 1,   32 KB, Assoc   8, LineSize  64
--------**------  Instruction Cache   4, Level 1,   32 KB, Assoc   8, LineSize  64
--------**------  Unified Cache       5, Level 2,    1 MB, Assoc   8, LineSize  64
--------********  Unified Cache       6, Level 3,   32 MB, Assoc  16, LineSize  64
----------**----  Data Cache          5, Level 1,   32 KB, Assoc   8, LineSize  64
----------**----  Instruction Cache   5, Level 1,   32 KB, Assoc   8, LineSize  64
----------**----  Unified Cache       7, Level 2,    1 MB, Assoc   8, LineSize  64
------------**--  Data Cache          6, Level 1,   32 KB, Assoc   8, LineSize  64
------------**--  Instruction Cache   6, Level 1,   32 KB, Assoc   8, LineSize  64
------------**--  Unified Cache       8, Level 2,    1 MB, Assoc   8, LineSize  64
--------------**  Data Cache          7, Level 1,   32 KB, Assoc   8, LineSize  64
--------------**  Instruction Cache   7, Level 1,   32 KB, Assoc   8, LineSize  64
--------------**  Unified Cache       9, Level 2,    1 MB, Assoc   8, LineSize  64

Logical Processor to Group Map:
****************  Group 0

Correct.
I don’t disable smt in the bios, but I intend to just pass 1 thread per physical core to the vm. I don’t need multicore efficiency. I just use the vm for gaming.
Here is the core activities, when I run multithread cinebench in the guest windows.
Here is the output lscpu -e, which confirms that 16-31 are siblings and mostly in idle.

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ       MHZ
  0    0      0    0 0:0:0:0          yes 6070.0000 400.0000 5224.9160
  1    0      0    1 1:1:1:0          yes 6070.0000 400.0000 5224.9209
  2    0      0    2 2:2:2:0          yes 6070.0000 400.0000 5224.9282
  3    0      0    3 3:3:3:0          yes 6070.0000 400.0000 5220.9941
  4    0      0    4 4:4:4:0          yes 6070.0000 400.0000 5220.9800
  5    0      0    5 5:5:5:0          yes 6070.0000 400.0000 5220.9932
  6    0      0    6 6:6:6:0          yes 6070.0000 400.0000 5220.9951
  7    0      0    7 7:7:7:0          yes 6070.0000 400.0000 5220.9771
  8    0      0    8 8:8:8:1          yes 6070.0000 400.0000 5090.8501
  9    0      0    9 9:9:9:1          yes 6070.0000 400.0000 5090.8262
 10    0      0   10 10:10:10:1       yes 6070.0000 400.0000 5090.8442
 11    0      0   11 11:11:11:1       yes 6070.0000 400.0000 5090.8462
 12    0      0   12 12:12:12:1       yes 6070.0000 400.0000 5090.8530
 13    0      0   13 13:13:13:1       yes 6070.0000 400.0000 5090.8408
 14    0      0   14 14:14:14:1       yes 6070.0000 400.0000 5090.8481
 15    0      0   15 15:15:15:1       yes 6070.0000 400.0000 5090.8379
 16    0      0    0 0:0:0:0          yes 6070.0000 400.0000 5214.4019
 17    0      0    1 1:1:1:0          yes 6070.0000 400.0000  400.0000
 18    0      0    2 2:2:2:0          yes 6070.0000 400.0000  400.0000
 19    0      0    3 3:3:3:0          yes 6070.0000 400.0000  400.0000
 20    0      0    4 4:4:4:0          yes 6070.0000 400.0000  400.0000
 21    0      0    5 5:5:5:0          yes 6070.0000 400.0000  400.0000
 22    0      0    6 6:6:6:0          yes 6070.0000 400.0000  400.0000
 23    0      0    7 7:7:7:0          yes 6070.0000 400.0000  400.0000
 24    0      0    8 8:8:8:1          yes 6070.0000 400.0000  400.0000
 25    0      0    9 9:9:9:1          yes 6070.0000 400.0000 5091.0811
 26    0      0   10 10:10:10:1       yes 6070.0000 400.0000  400.0000
 27    0      0   11 11:11:11:1       yes 6070.0000 400.0000 5092.7329
 28    0      0   12 12:12:12:1       yes 6070.0000 400.0000  400.0000
 29    0      0   13 13:13:13:1       yes 6070.0000 400.0000  400.0000
 30    0      0   14 14:14:14:1       yes 6070.0000 400.0000  400.0000
 31    0      0   15 15:15:15:1       yes 6070.0000 400.0000  400.0000

Additionally, I use this in my boot parameter, amd_pstate=passive. I guess it may affect single thread performance. I verified the boost of the host threads are not affected, but the vm threads seems not to boost high. I choose to keep this parameter as it does lower the idle temperature.

Geekbench 6 on the host.
https://browser.geekbench.com/v6/cpu/1334085

Geekbench 6 on the guest.
https://browser.geekbench.com/v6/cpu/1390536

PS: Curve Optimizer may play a role. What CO value do you use? How much Max Boost Frequency Offset do you use?
I set -15 on the best 4 cores, and use -10 on rest of them. +200MHz Max Offset.

just PBO enabled and CPU voltage at auto, I will deal with that as soon I got my AIO.
The highest load I saw in Ryzen Master with my cooler was 150W, so there are still 20 watts left.
I found going from Ultimate power profile to balanced in Windows is worth 100 points more single core performance in R23

Ok, works but not fully…
So, I no more have that white overlay in my whole desktop once I login, but I cannot run my VMs…
One that work working perfectly with 6.1 now does not start with this message:

libvirt.libvirtError: internal error: qemu unexpectedly closed the monitor: 2023-05-28T13:02:33.049123Z qemu-system-x86_64: -device {"driver":"vfio-pci","host":"0000:0e:00.0","id":"hostdev4","bus":"pci.6","addr":"0x0","rombar":0}: vfio 0000:0e:00.0: group 25 is not viable
Please ensure all devices within the iommu_group are bound to their vfio bus driver.

I am passing through a NIC, that is in a group with other devices, which there is no way I can pass through them too… Works with 6.1 though

Sorry no experience with that, I’m guessing you’ve already used the ACS override patch?
I got lucky with my board, it’s optimal for VFIO, everything is in a separate group, thanks to @jxdking for the recommendation, I only looked at X670e and didn’t have B650 on the radar

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.