Hello again, I am trying to plan out an X670E upgrade of my VFIO host. Single Windows VM strictly used for gaming.
I am leaning towards the 7900X because of the price but I am willing to reconsider the 7950X depending on your answers. I am skipping the X3D parts because they don’t seem to make enough of a difference for 4K gaming and I think the higher clocks will benefit me more, since I am having issues with my 1% lows with poorly optimized games (looking at you Hell Let Loose).
I want help with clearing some things up because my last and current setup is with the TR 2920x and a lot of stuff regarding CCDs and topology has changed since my last upgrade.
On the TR 2920x, for optimal performance, I am limited to using only one of the dies for the guest, since using cores from both dies means data has to travel between the NUMA nodes, increasing latency. The same is true for RAM, I can only use half of the RAM, the part that is connected to the NUMA node I am using for the guest. How does this relate to the CCD/CCX topology of the 7000 series? Do I need to use one CCD for the guest (limiting myself to 6 cores on the 7900X) for optimal performance? What about RAM?
What if I go for the 7900X and then need 8 cores in the VM? Will that present latency issues or L3 cache “poisoning” by the host OS? Are there any workarounds? I read somewhere that you can set up different CCXs as different dies in the QEMU XML, so Windows has sort of an idea about the topology but is that effective and does it work with asymmetric dies e.g. die 1 = 6 cores, die 2 = 2 cores? Is this relevant for the 7000 series or was it just for the 5000 and its multiple CCXs per CCD?
Is there any chance any current game might need more than 6c/12t at 4K or am I busting my head for no reason?
Anything else I need to look out for while considering which CPU to buy?
1 Like
While I do not have the hardware to test it myself, it would seem qemu from 4.1 has a “die” option. (ChangeLog/4.1 - QEMU)
I would assume this can be used to tell the guest OS about different dies/CCX’es - and thereby helping the Guest OS scheduler making better decitions.
1 Like
IIRC the 7800X3D hasnt been released yet, or at least the embargo hasnt lifted yet and that should get better gaming performance.
Why not just go bare metal and just isolate the network through a VLAN if it just strictly all gaming? I am currently contemplating this to my current desktop since I already have a Linux for work laptop. You only need the higher end ones IIRC if you do concurrent game streaming.
Considering AM5 plateform cost is still high, 7950x is still a better buy than 7900x. Unless you can get the 7900x $599 deal from Microcenter, which includes 7900x, Motherboard and 32GB RAM. Nothing can beat that deal right now.
Also, don’t skip 7950x3D. You can passthrough entire stacked CCD to Windows VM. It would work out of box.
I just completed my built and passed through Windows for gaming. For gaming it works better than I thought. The CPU score is better than what I got on bare metal.
My system:
CPU: AMD Ryzen 9 7950X 16-Core Processor
RAM: Kingston FURY Beast DDR5-5600 2x32GB
GPU: ZOTAC GAMING GeForce RTX 4070 Ti Trinity (12GB)
Main Board: MSI MPG X670E CARBON WIFI (MS-7D70)
Client OS: Windows 10 Pro
Host OS: Ubuntu 22.04.2 LTS
Storage: ZFS Storage Pool
3 Likes
Would you mind sharing your xml? I am also running the 7950X3D passthrough and I am only getting a 4000 cpu score
Edit: Managed to beef it up a lot. Atleast got my 60FPS in hogwarts legacy on ultra now. Switched around a few settings. Would still be interested in seeing more AM5/7950X3D xmls
Besides the xml, some boot paratmeters are important, too. “mitigations=off” plays a bit performance lift.
Below are the parameters that I use.
GRUB_CMDLINE_LINUX_DEFAULT="amd_iommu=on iommu=pt kvm.ignore_msrs=1 amd_pstate=passive mitigations=off pci=assign-busses vfio-pci.ids=10de:2782,10de:22bc"
For the VM. Here are things you need to look for.
- Enable “hugepages”, google it if you don’t know how to enable it.
- CPU configuration, use mode=“host-passthrough”.
- CPU pinning. Every vcpu actually represents a thread, not a core. The configuration below is how I passthrough 1st CCD. Note that, cpuset 0 and 16 actually are 2 threads on the 1st (same) core.
NOTE: I use “i440fx” because I migrated this VM from an old machine. If you create a new vm, using “q35” platform is highly recommended.
<domain type="kvm">
<name>win10gpu2</name>
<uuid>e3b05049-8201-4447-a4b3-b310220cc5bd</uuid>
<memory unit="KiB">40960000</memory>
<currentMemory unit="KiB">40960000</currentMemory>
<memoryBacking>
<hugepages/>
</memoryBacking>
<vcpu placement="static">16</vcpu>
<iothreads>2</iothreads>
<cputune>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="16"/>
<vcpupin vcpu="2" cpuset="1"/>
<vcpupin vcpu="3" cpuset="17"/>
<vcpupin vcpu="4" cpuset="2"/>
<vcpupin vcpu="5" cpuset="18"/>
<vcpupin vcpu="6" cpuset="3"/>
<vcpupin vcpu="7" cpuset="19"/>
<vcpupin vcpu="8" cpuset="4"/>
<vcpupin vcpu="9" cpuset="20"/>
<vcpupin vcpu="10" cpuset="5"/>
<vcpupin vcpu="11" cpuset="21"/>
<vcpupin vcpu="12" cpuset="6"/>
<vcpupin vcpu="13" cpuset="22"/>
<vcpupin vcpu="14" cpuset="7"/>
<vcpupin vcpu="15" cpuset="23"/>
<iothreadpin iothread="1" cpuset="8,24"/>
<iothreadpin iothread="2" cpuset="9,25"/>
</cputune>
<os>
<type arch="x86_64" machine="pc-i440fx-jammy">hvm</type>
<boot dev="hd"/>
</os>
<features>
<acpi/>
<apic/>
<hyperv mode="custom">
<relaxed state="on"/>
<vapic state="on"/>
<spinlocks state="on" retries="8191"/>
<vpindex state="on"/>
<synic state="on"/>
<stimer state="on"/>
<reset state="on"/>
<frequencies state="on"/>
</hyperv>
<vmport state="off"/>
</features>
<cpu mode="host-passthrough" check="none" migratable="off">
<topology sockets="1" dies="1" cores="8" threads="2"/>
<cache mode="passthrough"/>
<feature policy="require" name="topoext"/>
</cpu>
<clock offset="utc">
<timer name="rtc" tickpolicy="catchup"/>
<timer name="pit" tickpolicy="delay"/>
<timer name="hpet" present="no"/>
<timer name="hypervclock" present="yes"/>
<timer name="tsc" present="yes" mode="native"/>
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>destroy</on_crash>
<pm>
<suspend-to-mem enabled="no"/>
<suspend-to-disk enabled="no"/>
</pm>
<devices>
<emulator>/usr/bin/qemu-system-x86_64</emulator>
<disk type="block" device="disk">
<driver name="qemu" type="raw" cache="none" io="native" discard="unmap"/>
<source dev="/var/lib/libvirt/images/win10-gpu.img"/>
<target dev="vda" bus="virtio"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x08" function="0x0"/>
</disk>
<controller type="usb" index="0" model="ich9-ehci1">
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x7"/>
</controller>
<controller type="usb" index="0" model="ich9-uhci1">
<master startport="0"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0" multifunction="on"/>
</controller>
<controller type="usb" index="0" model="ich9-uhci2">
<master startport="2"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x1"/>
</controller>
<controller type="usb" index="0" model="ich9-uhci3">
<master startport="4"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x2"/>
</controller>
<controller type="pci" index="0" model="pci-root"/>
<controller type="pci" index="1" model="pci-bridge">
<model name="pci-bridge"/>
<target chassisNr="1"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x0b" function="0x0"/>
</controller>
<controller type="virtio-serial" index="0">
<address type="pci" domain="0x0000" bus="0x00" slot="0x06" function="0x0"/>
</controller>
<interface type="direct">
<mac address="52:54:00:60:42:55"/>
<source dev="enp15s0f1v7" mode="passthrough"/>
<model type="virtio"/>
<driver name="vhost"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x09" function="0x0"/>
</interface>
<serial type="pty">
<target type="isa-serial" port="0">
<model name="isa-serial"/>
</target>
</serial>
<console type="pty">
<target type="serial" port="0"/>
</console>
<input type="tablet" bus="usb">
<address type="usb" bus="0" port="1"/>
</input>
<input type="mouse" bus="ps2"/>
<input type="keyboard" bus="ps2"/>
<graphics type="spice" autoport="yes">
<listen type="address"/>
<gl enable="no"/>
</graphics>
<sound model="ich6">
<address type="pci" domain="0x0000" bus="0x00" slot="0x04" function="0x0"/>
</sound>
<audio id="1" type="spice"/>
<video>
<model type="qxl" ram="65536" vram="65536" vgamem="16384" heads="1" primary="yes"/>
<address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0"/>
</video>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x01" slot="0x00" function="0x0"/>
</source>
<address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0" multifunction="on"/>
</hostdev>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x01" slot="0x00" function="0x1"/>
</source>
<address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x1"/>
</hostdev>
<hostdev mode="subsystem" type="pci" managed="yes">
<source>
<address domain="0x0000" bus="0x13" slot="0x00" function="0x0"/>
</source>
<address type="pci" domain="0x0000" bus="0x00" slot="0x0a" function="0x0"/>
</hostdev>
<redirdev bus="usb" type="spicevmc">
<address type="usb" bus="0" port="2"/>
</redirdev>
<redirdev bus="usb" type="spicevmc">
<address type="usb" bus="0" port="3"/>
</redirdev>
<memballoon model="virtio">
<address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0"/>
</memballoon>
</devices>
</domain>
3 Likes
Thanks for your notes!
I was already fairly close to your setup, but I’m only landing at about 12800 CPU score.
Originally I was running 1 ioThread and also pinned an emulator pin, but none of those made a difference.
I have yet to test out static hugepages. Currently I’m only using the 2mb THP. I did try static allocation back in 2018 but did not see much of a difference, maybe I need to revisit this, but a 30% performance diff in cpu seems big
I’ll do some more testing over the weekend, maybe I can find something
The screen shot with 16898 cpu score was with different cpu pinning from the xml I provided. I pinned every vcpu to the actual core, total 16 cores both CCDs, thus there is no SMT(hyperthreading) going on. The score is much higher, which it should be. What is interesting is that it is even higher than the case I was running Windows on bare metal.
If I only passthrough 1st CCD just like the xml I provided, I can only get about 14600 cpu score. The question is whether 3DMark would scale with 3D vcache. In the real game, it should be over 10 percent boost with 3D vcache cpus.
PS: I run curve optimizer -15 on the 1st CCD, and -10 on the 2nd CCD; Max Boost Frequency is +200 offset in bios; Tuned DDR5-6000 profile.
Thanks for the clarifaction! That makes sense, I thought the screenshot and the XML belong together. Also fortunately means I’m quite satisfied with my setup performance wise (Have not touched any OC yet apart from RAM).
Amazing difference to my 6 year old cpu, not regretting the upgrade at all, although pricey, glad I went for the 7950X3D
So you pinned 1st (0 core) to the VM? Isn’t that potentially a problem since host kernel would keep using it anyway? Did you experiences any “stuttering” inside the VM as the result of that? How is your experience overall passing 3D V-cache cores to a gaming VM and keeping the rest for the host? Would you recommend it?
I plan to do something very similar and that’s why I’m trying to do some research first to make sure that I don’t spend a significant amount of money on the “wrong” part (i.e. get 7950x3D instead of 7950x for QEMU VMs). Games do like 3D V-cache but running it bare-metal means latency when communicating across two dies (that why I’d like to reserve 3D V-cache cores to a gaming VM to potentially avoid that but still get the benefit of 3D V-cache for gaming. If that doesn’t really work good enough I might as well just get a regular 7950x (or 9950x) CPU)
I’d really appricate if you could answer, thank you! Cheers!
I just demonstrated how to pin the cores. You may change it based on your testing. As today, I would avoid 9950x. It triples inter-ccd latency comparing to 7950x. It looks like a bug in hardware level.
In general, inter ccd latency of 7950x is not very high. You don’t need Xbox game bar with 7950x.
I see… thank you for replying! And one more question: how did you find out which of these cores have stacked V-cache and which do not? Does lscpu -e
show this information?