What kind of implications does a Hybrid CPU design have for performance?

In a discussion on the EEVBlog about the M1-Ultra, I looked at the CB R23 numbers both CPUs achieve at full power.
Take that napkin math with a grain of salt though! I would like some actual numbers, both for power and score/time to complete.

Was sparked by this amazingly misleading TomsHardware article with these numbers from GB5:
image


There is more to this though.
Amazing “mini-cores” to take on one task are utterly worthless when the scheduler in the OS leaves all those chores to a normal core.

What I could see as easier to get working right quickly was to make a chiplet CPU with big cores and small cores on separate packages with them advertised them as different CPUs. Then schedule ALL tasks on the small cores by default and only pull them to the big CPU when they push load of their current core to max.

2 Likes

From what I’ve seen since the new kernel update Intel is seeing better performance in Linux than in Windows 11. Also tangentially related while on the subject of Unix-like OSes it’s worth remembering that MacOS is now designed for Apple Silicon, which is also based on an ARM big.little architecture.

I think what surprises me most is how lower end parts are going full on P cores. I’d’ve done it the other way around. Give the Celeron 4 E cores and no P cores. Give the Pentium 2 P cores and 4 E cores. Something like that. If these are meant to be cheap, small, low power CPUs at the expensive of performance, then surely that’s what E cores are all about.

I am kind of blinkered to the point of them in a desktop application, personally. I get it for mobile and server, but idk. On the one hand when people are like “you can game on the P cores and use the E cores for x264 encoding and the two won’t interfere with eachother.” I get that, but I bought a CPU with 16 P cores for that express purpose and it works. in fact it’s more literally than that, the original plan for getting a 3950X was that I could literally give the game and OBS their own entire 8 core CCD, which isn’t that different of an idea to the 8+8 layout of the 12900K, except with them all being P cores. It doesn’t really explain what the E cores are for except that Intel still has issues with power consumption and is still drawing over 200W even with the E cores, which is way more than the all P-core 5950X. I think cynically Intel are losing market share in the server space to big.little ARM CPUs with many, many cores (and not to mention have lost Macs to ARM too) and this is a response to that, and what we’re seeing is their server machinations manifesting in desktop parts.

1 Like

The scheduling issue is an OS problem.

macOS does pretty well at it. That’s what you get with hardware and software integration.

I’m sure with a little more time Linux will as well.

Intel and Microsoft need to pull their collective fingers out and do better.

3 Likes

This is what confuses me: Linux has been using big.little architectures via ARM for years longer than MacOS. From what I’ve seen the 12900K does run better on Linux than on Windows since Intel’s updates reached the kernel. Windows, for that matter, has ARM variants that you would presume had been optimised for this kind of architecture. I’m not a software developer obviously, but it does strike me that there is a lot of reinventing the wheel for problems that I had assumed to be already solved.

1 Like

I honestly think that is where things are going. If you can intrinsically embed that information in an execution task, then I think you will win the mobile space. It takes cycles to spin up a core to max clock.

If you spend 10ms to spin up to max clocks on a P-core for a task that takes 1ms to complete. You will spend at best case 11 ms to complete the task. That does not include the spin down. And if you are spinning up and down between tasks, that is wasted efficiency and power.

On the other hand if it takes 5 ms to spin up an E-core and takes 3 ms to complete the task, you will at best spend 8 ms on the task. If the core is that much more efficient at power consumption than the P core, then that is a net win all around. And if you can queue those tasks better, then you can skip the spin up spin down process in between and get better efficiency.

2 Likes

Mostly in Japan and the rest of Asia. Look at all of the hardware add on for the Famicom, Super Famicom, and N64. Most of these were in rental Kiosks to play the games on the newer hardware. I don’t know f the Game Cube got any of that love.

Basically, you had to buy hardware, and pay a subscription service to get BC for the previous consoles on the current consoles. They took a hybrid Sony/MS approach.

2 Likes

POWER/PPC and Sparc had been doing that since the early 2000s. RISC architecture lends itself better to mult-SMT pipelines. Even then, you will never achieve 100% performance on a SMT thread. This is why AMD tried (arguably succeeded in retrospect with Intel Spectre and Meltdown mitigations) with the heavy machinery cores. Unfortunately like the Cell BE, developers were not into making highly parallelized workloads so x86 CMT did not really take off.

1 Like

Deflect, mislead, and then over commit. This has been the Intel way for decades now. They can market these as X core systems and only have to admit in the fine print that not all cores have the same capability. This is the same with AMD marketing the CMT heavy machinery CPUs as X core systems when they were at worst case X/2 core systems (and on the MS Windows side on the desktop operated as that mostly). AMD had the physical cores but the cores shared resources and could cause stalls if the scheduler did not properly load balance the cores to ensure that you were not trying to execute the same type of operation from the neighboring cores on the shared FPU.

3 Likes

But… but… but WIntel is too big to fail!

Not everything that makes it into the Android kernel until recently gets back ported into the Linux kernel, proper. The Android kernel is a fork of the Linux kernel.

On the MS side, I assume due to obligations to Intel and market share, the MS ARM initiative has been half-arsed at best.

2 Likes

I’m not talking about Android. Linux is on ARM outside of mobile, has been for ages. Ubuntu for ARM | Download | Ubuntu

They aren’t though. The E cores have been the most prominent part of Intel’s marketing of the 12th gen, I don’t think it’s accurate to say that the 12900K has been marketed as a 16 core with a caveat in the fine print at all. It’s almost been a failing of Intel’s marketing in the opposite direction: so far every outlet I have seen has put the 5800X3D up against the 12900KS and say how much better value the AMD part is, and completely gloss over the fact that they are comparing an 8 core with a 16 core part, and that the 12900K actually competes with the 5950X.

1 Like

Sure but BIG.little was purely an Android Technology on an Android platform. MIPS has existed on GNU/Linux for much longer than ARM but MIPS is dead now even though it was the king of the low power space because the innovation was locked behind SDKs once Imagination bought them.

My point being that BIG.little is a flavor of an implementation of ARM and even though it is ubiquitous in the Mobile phone space does not mean the same for ARM on mobile as a whole. The New ARM on mobile approach from Qualcomm proves that with their BIG core only implementation for the mobile laptop space and MS Windows support.

1 Like

Even though big.LITTLE is a patented ARM design (yes, it is the little that is capitalized), the term has become common usage to describe any hybrid design with both power and efficiency cores. Like many other commonly used terms, it may not be technically correct, but that isn’t going to stop the layman.

1 Like
  1. RE: more than 2 way SMT. There are diminishing returns there, I think mostly related to memory. 12-way SMT might seem like an effective way for a single core to do more things, but having real cores with their own real cache makes a difference - we will see this play out soon with Ampere Altra vs. AMD EPYC. Cache misses introduce an enormous performance penalty. In fact, it is recommended to disable SMT for AMD EPYC CPUs in high performance computing environments because of this problem.

  2. I have seen and heard this point of view a lot in the wake of Alder Lake. I think there are clear desktop applications of hybrid CPU design, just as there are in mobile. Efficiency on desktop does matter to some consumers (e.g. me). Moreover, if CPU designers can delegate boring tasks like IO and scheduling to efficient cores that don’t use up precious die space, that leaves more room for making P cores more performant, for example by giving them more cache and higher memory bandwidth (to say nothing of the relationship between execution mechanisms and die space, which is complex and opaque to me).

So, I think Intel is actually on to something here. Combined with their super complex packaging for their upcoming HPC GPUs, I would also guess that hybrid designs also allow for weird chiplet packaging solutions. You could slap together a bunch of E cores from tiny chiplets, plus a chiplet with a few performant P cores and voila, product.

1 Like

That would not work at all.
Data handling and Data processing are very separate tasks, from the lowest level up.
harvard
The complexity of having an external ALU take control as a “supervisor” over neighbouring processors would negate any potential gains.
Edit: And you would still need one “bootstrap”-core with a Hardware Control Unit to start the entire misery.

I/O is another different beast since it has to interface to wildly different busses. PCIe-Attached RAM could maybe made to work at some point in the future, DDR5 attached USB is just nightmare fuel for engineers.

4 Likes

That’s one hell of a task to put into a sheduler! The point is: what metrics of a process qualify it as being main or secondary task? There aren’t really too many of them for a scheduler to “figure it out” what goes where.

In really basic terms processes can be split in two cathegories:

  • CPU bound (more time spent using the CPU crunching numbers)
  • I/O bound (more time spent interacting with I/O devices and bouncing in and out of the CPU)

This is grossly how a scheduler sees processes. Now, based on this definition, I think it’s more clear how difficult can be implementing a smart scheduler that’s able to detect lower CPU impact processes and moving them to other cores.

What should also be taken into consideration is that moving processes across cores is not efficient and it takes almost the same time as it takes to spin up a new one. So, at what point, keeping a process on the P cores is more efficient than moving them on E cores, taking into account the context switching time?

I think that most of the figuring out, at this point, should be done at least by the OS preemtively or with the help of the user. Like knowing that messaging apps running in the background will never need a P core, so they’ll never be assigned to one. This knowledge might also come from a process behaviour analysis done over time as the user keeps using the machine.

ARM CPUs, that have been implementing this solution for a decade now, still struggle sometimes with power efficiency exactly for the reasons I cited before.

4 Likes

Just as a refresher on what the Scheduler has to handle on a basic level:
image

Loading all those high-level marketing wonderland jobs of “intelligently put the messenger-service on an E-Core” is WAY over the schedulers pay grade. And even then, on startup the chat-program may cause a lot of I/O from Disk and Network while loading memes or whatever.

You could also run into “thrashing” where all that moving processes to and from one core to the other bogs down the system with housekeeping tasks.

4 Likes

Thanks for that clarification. I knew something was not right with the way that I was typing it and it was nagging me.

2 Likes

This.

The scheduler essentially needs a crystal ball to know what core is best as even tagging processes doesn’t mean it’ll be always true.

Then there’s the race-to-sleep factor where it may not always hold true that a P-core will consume more than an E-core if the system can spin down the RAM & CPU faster.

If for example a process at this particular time finishes work twice as fast thanks to the larger cache on a P core will cause less RAM trashing.

Or future models of P/E cores where the efficiency balance changes.

And then how do we tag browsers or code editors like Visual Studios, KDevelop? Sometimes they’re just an editors idling , sometimes they’re processing files, parsing, …

Browser running heavy games, 3D visualisation, free online CAD tool (you want P core) then other times it’s an animated advertisement eating CPU that you don’t care about while reading news (should all be on E-core) but it’s still the same browser process.

And then there’s the problem of priority inversion (linky for more detail) : say you got a task on a P-core repeatedly stalled actively waiting to access the GPU to draw something but that messenger app is on an E-core taking its sweet time finishing all those draw calls and releasing the GPU to animate all those gifs & memes.

That will/may cause the overall system energy efficiency to be lower than keeping that messenger app on another P-core. That may changes according to what else is active on the system.

It’s a giant dynamic unpredictable mess of a problem to solve.

3 Likes

I see your point. I think what I meant by “IO”, and “boring things”, is “anything that is not directly single-core number crunching”. Which is definitely true: keeping the load on the CPU low by offloading tasks to dedicated processors like FPGAs and GPUs and DPUs and so on does yield performance improvements. E cores could be analogous, but for tasks that mostly run on the CPU.

Whether or not a task can be reliably assigned to the proper core is, I guess, the real problem. But it seems to work well in mobile, so I assume it must be possible.

1 Like

But that is still an interesting Computer Science problem to attempt to solve. That’s one of the things that keeps me employed. I need to get my Ph.D or Sc.D so that I can solve those kinds of issues. It would also get us one step closer to real AI.

2 Likes