What regulates the CPU frequency in the Linux kernel?

I have two AMD Ryzen 9 5950X systems. One has the Asus X570 TUF Gaming Pro WiFi motherboard and the other has an Asus ROG STRIX X570-E GAMING WIFI II motherboard.

On the X570 TUF Gaming system, I am able to use Linux kernel 5.14.15, but on the ROG STRIX X570-E GAMING WIFI II system, due to a problem with the USB (which I’ve still yet to finish diagnosing because the system is currently busy doing work for me), I had to roll back to Linux kernel 5.4.196-1.

What I have observed is that under a full load, the 5.14.15 will have an all-core boost clock of 4.5 GHz, whereas the system that’s running the 5.4.196-1 kernel, it will have an all-core boost clock of 4.6 GHz.

So, what is it about the two different kernels that regulate the frequencies differently?

(I am asking because I want to make sure that if I am going to send a MPI job to both compute nodes, that I am not going to get a problem with one finishing ever slightly faster compared to the other as a result of this clock speed difference.)

Thanks.

Different boards may have different PBO/XFR behavior out of the box.

Same board may even have different PBO/XFR behavior with different BIOS settings.

Different cooling can also impact different boost speeds, as can silicon die quality.

Check to see what the PBO/XFR setting is in the BIOS. Most BIOS seem to attempt to run faster than the AMD reference speeds based on how much voltage they feed the CPU.

If consistency is more important than throughput, go through the BIOS and try to make sure they are limited to “AMD Settings” for boost, cooling is the same, etc. Maybe even turn off XFR/PBO and lock the machines at a fixed frequency.

Comparing kernels behavior is only really valid if they’re on the same machine. The different kernel MAY have nothing to do with this. There are so many other variables.

4 Likes

There’s a linux package called ‘cpupower’
cpupower frequency-info
you can check if both system uses the same governors.

I’ve found cpupower only shows base clocks on ryzen, not boost clocks. On the other hand turbo_stat does show boost clocks. However if you want all the info take a look at (ryzen_monitor)[GitHub - hattedsquirrel/ryzen_monitor: Monitor power information of Ryzen processors via the PM table of the SMU].

When talking about closk on modern cpus, it’s really the CPU that’s deciding what clock to use. Ryzen specifically is known for being highly variable based on physical sample (so-called silicon lottery), motherbord/bios settings, and cpu temperature.

Also don’t look at CPu clocks to compare performance, even on the same CPus. Run some kind of benchmark with the workload that you’re going to run (like processing X many records). Then determine if there is a processing speed difference, and work from there.

2 Likes

I’ve found on EPYC that cpupower will show the correct boost clocks if run as non-root.

It will first try to ask the hardware, which gives only the base clock of the current pstate, i.e. not what we want. This succeeds only as root for me (might have to do with device permissions too). Only failing this, it will ask the kernel, which gives the actual boost clock :slight_smile:

1 Like

I use cat /proc/cpuinfo | grep MHz to read my core clocks.

So…I’m running the default settings and I don’t (explicitly) have PBO set/defined nor am I (explicitly) using it because both motherboard BIOSes have great big warning letters/statement that says something to the effect of “if you enable this, you will void your warranty”.

So, I’ve left all of those features/controls to its factory defaults.

Yeah, I guess that I’ll know more once I am able to finish testing the USB to see if I were to update the kernel again to 5.14.15, whether the clocks would line up with each other (given that I have two systems and the clock speeds on the ROG Strix is faster by about 100 MHz compared to the TUF Gaming motherboard).

If they end up both being the same with the kernel update, then it will tell me that the cause of the difference is in the linux kernel itself, and not anything else of the aforementioned items and/or possible causes.

Thank you.

I know how to read the CPU clock speeds and show them.

What I am asking about is how does the linux kernel control said CPU clock speeds? (because I have two different linux kernels that I am running on two different, separate systems, and the system that is running the slightly older kernel (5.4.196) is consistently 100 MHz FASTER than the system that is running the slightly newer kernel (5.14.15).

So I’ll update the kernel first to see if that will line both systems up and if it does, then it will tell me that it’s the difference in the kernel that’s the cause of the clock speed difference, rather than any other potential variable.

(And on the assumption that’s what it is, it still wouldn’t explain why kernel 5.4.196 is consistently 100 MHz faster than the 5.14.15 kernel.)

I’m actually putting my systems through the actual workloads that I need the system to process and then looking at the CPU clock speeds during the course of the run.

Moreover, I am purposely oversubscribing my hardware/CPU at about a 3:1 ratio in order to make sure that the cores are and remain fully loaded when I am checking and comparing the CPU clock speeds between the two systems. Both systems have SMT disabled, and sure enough, the ROG Strix system, when fully loaded, is about 60 MHz faster than the TUF Gaming system.

I guess I’ll know more when I run one of my HPC workloads on the system to see whether this clock speed difference is going to result in inconsistent results from said HPC application.

Thanks.

Does anybody know if there is a way for me to actually LIMIT the boost clock behaviour, i.e. set a maximum for all core boost, either in the BIOS or in the linux kernel itself? I’m certain that there’s a CPU frequency regulator in the linux kernel because whenever I start up one of my HPC applications, said HPC application actually tells me that the base clock speed or the idle clock speed is below what the base clock speed of the CPU is supposed to be. So, SOMETHING has got to be telling the CPU that it can drop down to 2.2 GHz when the base clock speed of the 5950X is supposed to be 3.4 GHz, unless that’s one of those P-state things that allows the CPU to clock DOWN when it is in “low/idle”.

Bear in mind that Ryzen CPUs clock themselves based on power, thermals and voltage. If you swapped both 5950Xs in both of your machines they’re likely to boost differently anyway due to silicon quality.

I believe this link describes a method, using the “userspace” cpufreq governor:
https://metebalci.com/blog/epyc-energy-consumption-test/

(it’s about EPYC but I assume it would work as well for Ryzen)

1 Like

Yeah, there’s probably some combination of that because the system that has the marginally slower frequency is actually running cooler than the system that has a slightly higher frequency by about 3 deg C or so. (give or take)

Thank you. I’ll take a look at that. At first glance, it looks like that it is what I am looking for, but it also looks like that they are using a number of other different tools to control it, and not just list what either the CPU and/or motherboard and/or BIOS and/or kernel has set by default.

Thanks.

Is there a difference in the throughput between the two platforms in terms of processing done? Clock frequency alone isn’t really something worth measuring unless you’re overclocking.

Unknown at this point because I am unpack and re-packing some files and prepping said files to be written to LTO-8 tape, so by virtue of the fact that it is working on two different datasets, I can’t get an apples-to-apples comparison against them yet.

That’ll likely come later as I will need to test it for two different types of HPC applications which uses entirely different solvers and there is a potential where one might not care so much and the other might care a LOT, so it will depend on whether I might end up with inconsistent results when running both nodes on a single problem, simultaneously, vs. running the same load twice, on the two separate compute nodes, to see if there is any meaningful difference in the total wall clock time for it to solve the problem.

So we shall see…

Right now, the question is really more geared towards clock frequency induced inconsistencies for MPI results, so I’ll have to see how tolerant the application is in regards to that.

I know that one of the HPC/CFD applications HATED the 12900K P-core and E-core arrangement to the extent that MPI WOULD NOT spawn the remaining child compute processes during application launch/startup. My hypothesis in regards to that was the difference in the operating clock frequencies between the P-cores and the E-cores, that it couldn’t synchronise the computations and/or couldn’t keep the system/application consistent as a result of that. But that’s just a theory given that I didn’t really know of a really good way to be able to test that and the fact that my 12900K had a catastrophic failure (where it couldn’t even run memtest86). But that’s an entirely different story altogether.

If you need absolutely the same performance between two systems, you’ll want to disable auto-overclocking and just run at base frequency. But you’ll be giving up a lot of performance so I doubt you’ll want or need to do this.

How is the memory config on both platforms, if they are running at different speeds or with different CAS latency then that could also cause a performance difference.

The P cores and E cores of alder lake are basically entirely different CPUs. So I wouldn’t pay much heed to those results. I would be disabling E cores for any CPU-heavy workloads (or change cpu affinity for the specific processes).

Actually, I thought of a slightly different take on that, where instead of purposely running both CPU at the base clock speed, I would set the both systems to run at the known all core boost (e.g. 4.5 GHz) and keep them both there.

The memory config is the same on both systems:
4x Crucial 32GB DDR4-3200 unbuffered, non-ECC RAM, I think it’s CL22. (total of 128 GB per node)

Raw speed isn’t nearly quite as important as capacity.

re: 12900K
I ended up RMAing my 12900K back to Intel for a full refund. It seems crazy that you are going to spend all that money on said 12900K only to then purposely cripple the processor by disabling all of its E-cores, just so that software that wasn’t written with that in mind, won’t throw a fit when you attempt to launch the application.

This was another reason why I ended up switching my 12900K system for a 5950X system. You’re paying for a 12900K, but then it seems kinda silly to then only use ~1/2 of the processor.

(The HPC application actually will automatically set the CPU affinity mask on launch, and failed to do so or failed to complete the launch/startup process/sequence. So even with that, it still failed.)