9950X Core Parking: Linux vs Windows

Enjoying the in-depth reviews of the new ryzen cpus! I know there’s some investigation being done regarding linux gaming performance on the 9000 series, would love to see some benchmarks comparing the uplift (if any) core parking provides on linux with the 9950X/9900X vs what it provides on windows.

The gamemoderun command (from GitHub - FeralInteractive/gamemode: Optimise Linux system performance on demand) that is commonly used to tweak linux gaming performance provides a configuration option that can take care of this by manually specifying core numbers… i.e. park_cores=16-31 for a 9950X (gamemode/example/gamemode.ini at master · FeralInteractive/gamemode · GitHub)

Hope this is of interest, thanks for all you do!

1 Like

Don’t some mobo’s have a BIOS option that may cause some issues with the OS ?

I think Jay found something on his ASrock board that shows an AUTO by default setting which seems to actually default to something like Frequency instead of allowing the OS to do this.

He found that you have to set it manually to the DRIVER option to work properly, at least for Windows.

Perhaps the Linux kernal is a bit smarter than this ?

Hello, just curious, is anyone running a 7950x3D using Linux? If so, is the core parking set-up similar to Windows?

1 Like

I don’t currently have a Ryzen dual ccd processor, so you can take what I say with a grain of salt (I have been doing research for an upcoming purchase).

The BIOS option is not an issue is not dependent on any OS, it is that some motherboards default options is not configured correctly. This is why Jay had to select “DRIVER” option. The driver option just means that the driver has some programs that it recognizes and knows which core option to chose; whether the linux driver has programs configured this way, I have no idea. However, this would be the case on any mobo

1 Like

@sethlin

Thanks for the feedback :slight_smile:

I have 7950x (not a x3D). I don’t bother core parking. 7950x3D won’t be worse than 7950x in gaming. The high clock CCD is identical for them, while the low clock CCD of 7950x3D has 3D vcache. In the wost case without core parking, it will be still faster than 7950x.

9950x is strange. Based on what I have heard, it has 2.5x CCD-to-CCD latency comparing to 7950x. That is why AMD states 9950x needs xbox game bar to work properly with games.

1 Like

I just ran some core-core benchmarks on my Ryzen 9 9950X and indeed the cross chiplet latency is almost triple compared to the Ryzen 9 7950X. I may try to setup KVM to pass through only the cores on the single faster CCD to give windows a better chance at scheduling my gamez hah…

core-core latency

CPU           intra-ccd    cross-ccd
Ryzen 9 9950X 20ns         210ns
Ryzen 9 7950X 18ns         70ns

References

  • Dual GPU AI Gaming Workstation Build pcpartpicker /b/tMsXsY
  • core-to-core-latency github nviennot/core-to-core-latency/issues/107
1 Like

1 Like

For funzies I went back and ran the core-core latency tests again after changing a few variables, but none of these showed a significant impact:

  • Updating Linux Kernel, from 6.9.7-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 28 Jun 2024 04:32:50 +0000 x86_64 GNU/Linux to 6.10.6-arch1-1 #1 SMP PREEMPT_DYNAMIC Mon, 19 Aug 2024 17:02:39 +0000 x86_64 GNU/Linux
  • Making BIOS Tweaks mentioned by Wendell YT /watch?v=0eY34dwpioQ&t=1917s
  • Adding kernel boot parameter to grub mitigations=off and confirming changes in lscpu for Spectre etc… This one possibly reduced max cross ccd latency by a few percent if anything.

FWIW using latest available BIOS version: 3.06 release date 2024/7/30.

2 Likes

The ccd to ccd latency might be patched very soon. The rumor says at the end of this month.

2 Likes

Interesting rumor!

I’m not sure what kinds of workloads are core-core io heavy? I saw a reddit thread mentioning the phoronix tests for postgres showing the 9950X falling behind expectations due to this.

reddit /r/hardware/comments/1ey25kv/zen_5_ccx_latency_only_9950x_not_9900x/

In my limited anecdotal local testing llama.cpp GPU + CPU offload GGUF inferencing is bottlenecked by RAM bandwidth and probably not doing much core-core io.

To test I pin’d the threads to specific cores/smt’s and measured performance and max CPU temps e.g.

  • taskset --cpu-list 0-15 # 7.90 tok/sec ~75degC
  • taskset --cpu-list 0-7,16-23 # 6.72 tok/sec ~83degC

I’m not 100% clear but believe threads 0-7 and 16-23 are the hyper threads for the same cores on the same chiplet.

So this test only shows its better to spread the workload across real cores on both chiplets rather than pin everything to hyperthreads on a single chiplet for this specific use case.

I’d have to get more creative and use only 8x threads either on a single ccd vs half on one ccd and half on the other to look more closely at core-core latency effect for a given workload or disable SMT.

Also noticed that setting avx512f off vs on in BIOS (there are two places to do this and I used both, confirm with lscpu | grep avx512f and recompile llama.cpp) doesn’t affect llama.cpp GGUF inferencing noticeably either in my limited testing. Which suggests to me that the big AVX512 uplift for CPU workloads is likely only really applicable for limited use cases e.g. a low power CPU only datacenter doing AI tasks that are not RAM i/o bottlenecked. In most of my use cases I’m using GPU which is much faster anyway, or am likely RAM i/o bottlenecked…

Are there any actual common workloads where CPU AVX512f will show practical uplift given available GPU resources like a typical gaming desktop PC?

1 Like

Numerical software (HPC, simulations, linear algebra), but often only if you compile them yourself.

More mainstream, video encoding/decoding can make good use of AVX512

llama.cpp does not benefit from avx512 last I checked, there are some github issues about this if you google. So that is to be expected and seems to be the case for intel CPUs as well.

Looking at the phoronix benchmarks it is clear that avx512 is much faster on the 9950x (up to 2x). You’re not necessarily always memory bottlenecked (even if that is often the case).

1 Like

I’ve been enjoying the performance uplift of the ‘universal’ SIMD application (period_search_10220_x86_64-pc-linux-gnu) for the Asteroids http://asteroidsathome.net/boinc/ project. It automatically chooses the highest SIMD instruction set that the cpu supports. Been working great on my 7950X and 9950X cpus.

The application uses the AVX-512dq instruction set on my Zen 4/5 cpus.
elfx86exts ./period_search_10220_x86_64-pc-linux-gnu
File format and CPU architecture: Elf, X86_64
MODE64 (call)
CMOV (cmove)
SSE2 (movdqu)
SSE1 (movaps)
AVX (vmovsd)
NOVLX (vmovupd)
AVX512 (vbroadcastsd)
DQI (vpmovm2q)
SSE3 (movddup)
AVX2 (vbroadcastsd)
FMA4 (vfmaddsd)
NOT64BITMODE (xchg)
BMI (tzcnt)
SSE42 (pcmpistri)
BMI2 (sarx)
SSSE3 (pshufb)
VLX (vmovdqu64)
BWI (vpbroadcastb)
SSE41 (pminud)
RTM (xbegin)
Instruction set extensions used: AVX, AVX2, AVX512, BMI, BMI2, BWI, CMOV, DQI, FMA4, MODE64, NOT64BITMODE, NOVLX, RTM, SSE1, SSE2, SSE3, SSE41, SSE42, SSSE3, VLX
CPU Generation: Unknown

2 Likes

I only run these in winter as heat supplement for the house.

I just run the A/C all year round . . . . and pay the insane costs.

Here we go. Inter CCD latency is fixed in AGESA 1.2.0.2.
https://videocardz.com/newz/amd-ryzen-9000-inter-core-latency-significantly-reduced-with-new-agesa-1202

2 Likes