What kind of implications does a Hybrid CPU design have for performance?

luukas · April 20, 2022, 3:43pm

Just a small note: big.LITTLE is not just an ARM trademark but a rather specific way of bringing different architectures together. In its original design big.LITTLE wasn’t even really heterogeneous that’s something that came a lot later with a new name (ARM DynamiQ). People just keep the name big.LITTLE alive.
big.LITTLE had a very specific purpose: Reducing idle power consumption of mobile devices.
All modern cpu architectures (at least high performance ones) use out of order execution (OOE) to enable way better performance. But there is a problem: OOE increases power consumption by a huge margin and it doesn’t drop when the cpu does nothing. Therefore ARM created big Cores with OOE and LITTLE cores without OOE (and some minor changes like smaller caches, less execution engines etc).
In something like a phone you want high performant cores with OOE to give the user a great UX and still be able to do nothing but updating some notifications for a while.
big.LITTLE solved that by bringing little and BIG cores into a four core cluster that can switch between big cores and LITTLE cores at run time.

You always had either the big or the LITTLE cores active, never both at the same time. Therefore you didn’t even need an heterogeneous scheduler since there was no hybrid mode. To the OS it was more like a power state switch nothing more.
Later big.LITTLE became more flexible and then it was replaced by DynamiQ. But the new name never really caught on.

Mastic_Warrior · April 21, 2022, 12:09pm

Thanks for the informative post. I have honestly never heard of DynamiQ. I also did not realize that the big.LITTLE concept was originally Architecture agnostic.

HaaStyleCat · April 21, 2022, 3:45pm

Thought id throw in two items that caught my attention…

1.) I HOPE Zen 5 or any other future architecture does not try to re cycle older architectures unless it is their mobile or down wattage versions. Like the 35w. Id rather see the newer achitecture down powered to be more efficient.

Along with this… I have a question and its probably dumb…BUT here goes.

Ive spent some time overclocking in the not too recent past with both AMD and Intel. Intel had a function of adaptive power for its cores that could be set (which usually required a offset) which really drops idle power. Now the windows power options for high power keep the cores at max clock for snapier responce but more power but the actual responce is negligable between the two. At least from a user stand point. Now AMD is trickier, I have yet to decipher how they limit max power/OR have a adaptive power option on their chips. Im not sure if its a conflict between bios of the board or software drivers. Ive read up on the chip settings, but their chip drivers and software of how they “boost” the chip is very vegue. I know it uses temps and power draws, but it wont seem to give you too much control.

Now back to chiplets or big little…
If a chip can down power itself via bios (ie from a 95w tdp to say 35w), and bios can do a per core control, and AMD has chiplets already (ie two 8 core in the x900 series) why cant a chiplet be power limited for E tasks (or even individual cores for that matter say ones that cant clock as high) while the other chip does all P tasks, till more performance is needed then the rest of the cores chime in and up power themselves?..I know this is probably a over simplification, and im no tech guru… but it seems that would be possible even with a single chip like older intel chips.

Just my thoughts. I feel just installing effecency cores (at least for desktop powered systems) is wasted performance left on the table. I also think AMD could lower its power consumption if it had a better adaptive power useage than when you try to OC the power is static to excess.

I know silicon performance is limited to manufacturing quality and there is a silicon lottery so to speak for power needed for performance…but I think theres more that could be done to optimize vs just throwing the big/little architecture out there. Feels like a ploy by Intel to say hey look we have more cores like AMD. Again Im ignorant lol.

cakeisamadeupdrug · April 22, 2022, 6:17pm

I don’t really agree with this. Intel is not utilising efficiency and performance cores because they want to get into phones, they’re doing so because ARM’s hybrid design is making massive leaps into the server space, and those servers are using Linux, which is the actual point I made. Again, you mentioned Android, but it’s not at all relevant to my point that standard desktop Linux supports these processors so it doesn’t make sense that it wouldn’t be able to handle Intel doing the same thing.

E-Wasted · April 22, 2022, 6:45pm

RedGamingTech is reporting that the Zen4C “little cores” on Zen5 “are likely for specific SKUs only, such as APUs/lower power. High performance chips (TR, Epyc, and desktop will only feature Zen5 cores).”

This is a huge development as it appears AMD is not going big.LITTLE for the entire Zen5 lineup, which makes sense because they’re not locked into a monolithic design like Intel. Also Zen5 is going to use a shared L2 cache on the fabric, so while the per core/chiplet amount may be lower overall there should still be an improvement in at least efficiency.

cakeisamadeupdrug · April 22, 2022, 7:11pm

Right now AMD is not really demonstrating a need for little cores. For workstation tasks the 5950X and 12900K are extremely close to eachother as far as 16 core head to head tests go, but the 5950X uses dramatically less power, even though we are comparing 8 efficiency cores with 8 performance cores. I think in the long run if AMD wants to stay competitive in the server space then big.LITTLE is going to become necessary, but I don’t think the home desktop market is especially crying out for the design unless power usage gets out of control like Intel has allowed it to.

twin_savage · April 22, 2022, 8:42pm

I’m not aware of any ARM server using a big.LITTLE or heterogeneous/hybrid architecture; graviton/Annapurna, cavium, ampere, nxp and nvidia grace are all homogenious as has every server chip I can think has always been.

You’ve hit the name on the head, intel was thinking this way back during ivy bride when they went with the integrated voltage regulators on the die of the cpu itself, so that power level switching (think C states) could be accomplished orders of magnitude faster than when the CPU relied on the motherboards VRM to transition power levels. These buck converters ran well north of 100mhz (compared to a motherboard’s VRM frequency of <1mhz), it’s not hard to imagine how they got the power level response times down. Overclockers were very upset with this development at the time because it limited overclocking potential because the VRMs on motherboards were relegated to an intermediate stage of power conditioning.

Downclocking and undervolting almost any cpu can result in incredible efficiency gains, making the need for another type of cpu core not very attractive. Remember when the asus zenphone 2 came out and beat typical ARM android phones (with similar battery capacity) when it came to battery life? That phone used a downclocked x86 cpu (wasn’t very performant though, was about 15% slower than a comparable ARM phone at the time).

HaaStyleCat · April 22, 2022, 9:08pm

My best experiance was my i7 8086k with overclocking/under volting via the variable power. I managed to get it down to 1.27v max with all cores oced to 5Ghz. It would idle around 0.63v. I did a delid/liquid metal treatment and custome loop. I can run prime 95 even with avx and be stable and not get past 70-75C. Best chip I’ve ever had. Really took a gamble doing it, but was fun. I havent been able to mess with anything newer besides ryzen, and thats been kind of a dissapointment honestly. Probably from my lack of knowledge, but I cant seem to figure out how to OC properly.

rv6502 · April 23, 2022, 3:31am

You could but the thing is there’s no point in doing that on Ryzen they essentially do that on their own automatically.

They’re really aggressive on power saving and are both efficient and performance at the same time.

It’s the reason Intel is desperate right now and had to add E-cores.
They sat on their laurels for years, closed their best R&D dept, and AMD just zoomed right past them blowing a La Cucaracha car horn.

Each pack of 4 cores have an independent power-saving down-clock and each core has an independent boost.

For example (and this is one of their desktop chip!)
here idle on the left and 1 core loaded (#4) on the right:

All the other cores (except #12 that was updating zenmonitor+xorg ) are consuming nothing.

Race-to-sleep is the proper strategy on Ryzen

Package power includes the fabric, memory controller, caches, other parts of the SoC.

And all that other stuff has to wake up (even on Intel) to support an active P-core or on Ryzen to service a core running at full-tilt.

So having 1 chiplet forced to a lower clock / power limit and 1 at chiplet full speed won’t save much (if anything) because the entire package overhead goes up as soon as 1 core runs hard (it needs RAM, cache, PCIe bus, etc).

What I think happens here regarding overclocking (mostly speculation on my part) is that AMD is already “OCing” the cores for you to their stable limit while still supporting power-saving adaptive clock & voltage changes that are carefully tuned by pre-calculated tables contained in the System Management Unit (SMU) and when pushing Ryzen past those numbers there is just no information available as to what voltage x multipliers combinations can each cores handle at say, a 105Mhz base clock instead of the 100Mhz standard base clock so the SMU controller chip can’t adjust anything to save power.

But the above paragraph is just me guessing.

Because Ryzen is already super aggressive about voltages and clocks in relation to temperatures & power saving.

I got a 1950X that I tried OC’ing a bit when I got it just for giggles and simply could not get it stable to the same performance level that leaving it alone to pick the clocks it wants did.
If I tried OC it would nope-out at 3.7Ghz, crash and burn randomly.

But when left alone on stock settings it peaks at 4.1GHz on a few cores (it’s supposed to max out at 4.0GHz maybe it’s a reporting glitch? but CPU freq monitor says 4.1GHz sometimes), launch all-cores starting at 3.8GHz and gradually settling to 3.5GHz.

And that’s a 1st gen Ryzen.
The 2nd, 2+ and 3rd got even more granular/aggressive in how they adjust clock & voltage.

They’re already flying so close to the sun.

HaaStyleCat · April 23, 2022, 1:36pm

Not to get too off topic, but I agree. You are correct in the problems with over clocking. I know the chips are “smart” and I agree they do the work for you to get maximum performance on their own… My problem is I wonder if the user HAD the ability to undervolt, and still allow the chip to use variable power, what kind of performance could we really get from a ryzen chip given adequate cooling and would it lower what power is really needed to get those kind of clocks?

Like with my Intel chip for example… It believed it needed 1.35v to 1.45 volts to get 5.0Ghz making the chip run VERY hot up to the 85-90C mark on prime 95. I was able to manually set the ceiling to 1.27 and the 5.0Ghz was completely stable (Again-this depends on the quality of the silicon) which brought temps down to roughly 65-70C. I wonder if AMD’s algorithms shoot for MAXIMUM power needed to plow through for said clock target when less might actually be needed was all I was wondering.

cakeisamadeupdrug · April 23, 2022, 3:05pm

Am I mistaken? That might go some way to explaining why the Linux kernel is less adapted to hybrid CPU designs than I presumed.

thro · April 24, 2022, 1:35am

Part of the problem with Windows and scheduling is this:

They have a scheduler. They have > 1 billion machines out there that will get updates if they push an update to the scheduler.

That’s 1 billion machines full of badly written third party drivers (some of which will never be updated), software corner cases, etc.

Making ANY change to the windows scheduler is going to be a freaking nightmare. This is why ThreadRipper was so badly supported, and that’s just a minor tweak to the existing NUMA scheduler they had (i.e., treating a single socket sort of like it is two sockets with TR v1.0).

big.LITTLE stuff is a huge paradigm shift for the way the windows scheduler works so expect Microsoft to be very reluctant to port it to the mainstream version of Windows. This is why they’re limiting it to Windows 11 at the moment is my bet.

Apple may look like they’ve just pulled this out of their arse at short notice, but the work to make this happen in macOS has been in development for YEARS, literally a decade or more at this point.

I’d wager GCD (Grand Central Dispatch) is core to this, and that goes back to 2009. i.e., apple started their way down this path (making the OS more responsible for thread scheduling) 13 freaking years ago. They’ve been shipping little.BIG in the mobile space (which essentially uses a very closely related kernel/OS to macOS) for what… 5 plus years now?

Microsoft and intel are at least half a decade behind at the moment in terms of software/hardware support for this concept, and the ecosystem they play in is FAR more resistant to change and fraught with far more difficult circumstances. They can’t just rip and and replace the entire software stack like Apple can because there’s so much reliance on shitty third party software drivers.

GCD has built in support for different work unit priority (QoS for threads, essentially), which lends itself very well to whether to put the work on performance or E cores, and is adaptable to current operating conditions, not baked into the application at compile time.

Software has been able to make use of GCD for other purposes for over a decade now, and now hybrid processors are in the mix, GCD can effectively be tweaked to make use of them properly without any application changes…

I swear, Apple’s software team well, the core OS/library guys at least) are freaking amazing. like NeXT level amazing (see what I did there?). No they aren’t perfect. Yes there are bugs. But the software concepts they put out decades in advance are so far ahead of their time.

While Intel and Microsoft are playing checkers, the Apple OS team are playing fucking 4d chess - and what you see them doing in software today is setting up for 5-10 years out.

jaskij · April 24, 2022, 3:24am

(CC @MazeFrame )

Depend on what you mean by real time. Hard real time stuff generally runs on in-order CPUs, because it’s difficult to provide hard time guarantees otherwise. And even then, stuff is probably core-locked and communicating via mailboxes.

For safety critical systems you have two or more cores working in lockstep, to detect faults in them (it’s a feature of ARM Cortex-M, Cortex-R and some Cortex-A cores).

I believe the technically correct term here is “heterogeneous processors”, or as I like to shorten it, “hetero processors”.

luukas · April 24, 2022, 9:08pm

The Fujitsu A64FX CPUs used in the Fugaku super computer (most powerful world wide) uses a more or less hybrid design. It is a 48+4 design. But all the cores do use the same architecture. In this case it’s all about utilizing the SoCs ressources

twin_savage · April 25, 2022, 3:22am

ahh I had forgotten all about that one!
Apparently the Chinese HPC CPUs (Sunway) use manager cores for each cluster of compute cores, similar-ish to how the Japanese one is architected.
There must be more benefits that I had originally thought to dedicate one whole core to manage the workloads of clusters of “working” cores if multiple companies and countries gravitated toward that approach.

Prenihility · May 11, 2022, 2:53pm

Is late a hell. I still don’t follow. So the hardware add-ons weren’t able to be purchased. Only rented? You’ll have to post some pics.

MetalizeYourBrain:

In really basic terms processes can be split in two cathegories:

CPU bound (more time spent using the CPU crunching numbers)

I/O bound (more time spent interacting with I/O devices and bouncing in and out of the CPU)

This is grossly how a scheduler sees processes. Now, based on this definition, I think it’s more clear how difficult can be implementing a smart scheduler that’s able to detect lower CPU impact processes and moving them to other cores.

What should also be taken into consideration is that moving processes across cores is not efficient and it takes almost the same time as it takes to spin up a new one. So, at what point, keeping a process on the P cores is more efficient than moving them on E cores, taking into account the context switching time?

I think that most of the figuring out, at this point, should be done at least by the OS preemtively or with the help of the user. Like knowing that messaging apps running in the background will never need a P core, so they’ll never be assigned to one. This knowledge might also come from a process behaviour analysis done over time as the user keeps using the machine.

ARM CPUs, that have been implementing this solution for a decade now, still struggle sometimes with power efficiency exactly for the reasons I cited before.

Maybe they can do something analogous to how disclosing dyes or markers work in the world of medicine. In the metaphorical sense, of course. Could processes not carry their own tags, and instantly be recognizable by the scheduler?

And to go off on the side here, Nova Lake is looking to be very, VERY enticing. And I think that’s the CPU architecture to wait for. I really, REALLY need to upgrade. And i’m in a super-shitty situation. 4 cores, 4 threads. And a 7 year old GPU. But I don’t want to get something crazy, or anything altogether at this time. I’m thinking Nova Lake is going to be the CPU to wait for. Zen 4 is already sounding interesting, too. But Intel are saying it will be the greatest micro-architectural change since the Core product line. That statement alone sounds intriguing.

MetalizeYourBrain · May 12, 2022, 7:30am

Yes, that’s exactly what I meant to say when I said

I think there’s surely work to be done before the programs land in the processes queue to increase the efficiency of etherogeneous CPU architectures.

I don’t want to sound too much like a debbie downer but maybe they’re just trying to keep the investors happy due to the spanking they got all of a sudden by AMD since Ryzen came out. But there must be some truth to that aswell. Cautious skepticism is a good way to go about it, in my opinion.

Mastic_Warrior · May 12, 2022, 11:14am

No, the hardware, you bought. The services that worked with those hardware add-ons were mostly rental services.

Example: Famicom Floppy disk was an add-on that you bought. Where do you get the games? Rental kiosks at the 7-Eleven and FamilyMart. You would sign in with your Nintendo account if you paid the subscription service, or you bought the games a la carte and put them on the floppy disk. When the kiosk system went away, if your floppy went bad then you lost access to your games. And we are talking about a Japanese company, the disks were proprietary so good luck hacking together your own hardware at the time to get cheap access to this system.

Prenihility · May 12, 2022, 2:49pm

WOW. Sounds confusing, and crazy. Well, at least that faded into obscurity, and they never made a console that was $600 and suffered a total hardware failure within a couple years of purchase. Not to mention glorified it as if it cured cancer just by sitting in your damn living room. Fuckin Sony/PS3.

But that is really interesting. WOW. Still, I remember the N64 had some b/c. Someone showed it off in a Youtube vid. I’m thinking the Famicom Disk System was more of a shitty design/thinking? At that time, everything had to be physical hardware based, the internet wasn’t where it was now of course, too. I don’t know. All of the evidence and Nintendo’s track record since the 5th generation onwards shows that they’re very innovative and forward-thinking. They don’t like following anyone else.

You don’t have to tell me that. As a gamer, i’ve been exposed to rumors, followed by contradictions. Time and time again. Typical game industry hype. Usually ends up as trash. Or the polar opposite.

Zibob · May 12, 2022, 5:36pm

Not to keep the off tipic going but there were annoyingly quite a few region specific additions to games too and this contines to even now with the like of Pokémeon games getting events that are only in Japan with one off pokemon and aditional area and such.

The rewritable disks above also featured entirely unique games there were never issued outside that system. So literally buying a nintendo console and games anywhere but Japan means you are never getting the full console or games. It is really annoying, or would be if nintendo weren’t so aweful i avoid them

For preservationists this is maddening.

Thankfully The Cutting Room Floor wiki, a record of cut and unused content in games, has many if not all of the one off games dumped to rom now. Interesting for hardcore fans of zelda and the like as there was at least one download only temporaryily available Zelda game.