Alder lake as 2 node on single die split at Pcore/Ecore boundry

Hi ladies and gents,

Warning, the following involves a software guy talking about hardware :wink: :

This morning came across the PR from intel to add support for more advanced hybrid scheduling as of kernel 5.18. Was also thinking the other day about Intel Alder lake and it’s Hybrid architecture and what might be possible if we were to treat the 2 subcategories of chips as 2 different nodes on the same package.

I feel as if some if not many of the lessons learned on AMD’s recent dual chiplet systems might be applied to Alder lake and future Hybrid architectures to better support mixed ISA’s.

What we saw with the likes of the 5900x was a dual chiplet CPU with one high bin chiplet and a second lower bin chiplet.

Through various subsystems, the scheduler can take advantage of which cores are adjacent to each other/on the same CCX to better take advantage of L3 cache locality.

Similar to what would happen in older dual socket systems with the MC located outside of the CPU’s themselves to have a multi node, Uniform Memory Access system, but in this case on a single package and with 2 different installed CPU’s.

I’m thinking Conceptually to something from the socket LGA1366 days. Something along the lines of one socket having a Nehalem X55X0 CPU and the other a Westmere X56X0 chip.

Same memory support, socket, IMC etc.

Slightly augmented instruction set (westmere added some AES acceleration options over Nehalem for example)

I don’t know if such a system existed (I doubt it did).

My question is, since we already deal with telling the scheduler to keep threads away from certain cores based on their location and performance characteristics, can we extend this idea to splitting out parts of the hybrid CPU into a form of local clustering?

1 Like

Well, Raja from Intel messaged me back about this being of interest on twitter. Maybe this goes somewhere?

Interesting idea, I mean in theory it is possible. Really what needs to happen though is that background/useless tasks should be farmed off to E-Cores to free up all the resources of the P cores for the most performance. Then on laptops you would want to keep most things on E-Cores unless its something like a foreground task which benefits UX.

You could kind of set the perf levels in windows on the E-cores as the lowest cores too, but that won’t necessarily fix the problem because this is only about performance and not about power consumption or physical topology.

With the AMD chips; the “best” core is always on one die, but sometimes the second best core isn’t on the same die. AMD I think tried to configure it so that the two “best cores” for windows would be on the same die, but then each die has a “best” core as well.
There is more work to be done here as well, because placing tasks on cores next to each other are not always advantageous. Power consumption, heat, and total cache also matter. Some applications use quite a bit of cache and doing two heavy tasks at once might work better if both are on different chiplets… Just like how a cluster of Atom cores (E-Cores) might do better than 2 P cores which are generating way more heat.
So everything is a trade off, the question is how to make the subsystem intelligent enough to pick the proper task shifting and topology usage based on the type of task or even the application itself, since not all applications in the same “type” of application necessarily work the same way or use resources in the same manner.

Just some thoughts

Currently diagramming out what a POC would look like.

For a start currently considering having user space apps bounds to p-cores only, then have kernel space locked to ecore’s.

Reason for such a simplistic approach is that the kernel doesnt use any AVX512 code at the moment outside of some CPU raid calculation which can be disabled.

That could work as well. Kernel tasks though are quite simple in general and I doubt they could keep the E-Cores busy. Also if all I/O had to go through the E-Cores it might lead to lower performance since those are technically Atom cores which aren’t great for I/O.

1 Like

Maybe intel should just make a proper big.little arch (one with the same instruction set across cores) rather than raiding whatever 6 year old crap they have in the parts bin for the e cores.

Part of my thinking is that, even if they aren’t fully utilized, by shifting kernel tasks to the gracemount ecores we can avoid dealing with context switching on pcores.

We also get the added benefit of being able to remove some of the current spectre/meltdown mitigation, being that kernel data isn’t on the pcores in the first place.

Beyond that, the gracemont design of the alder lake ecores is a ridiculous 17 execution port wide design, allowing for lots of simple, short tasks to be done in parallel, even with SMT off

Well that would require too much time and cost. E-Cores are just Atom cores which are funnily enough very similar to Bulldozer cores. They use Clustered Multi-threading and this is why they have such nice parallel performance while having such bad single thread performance.
Its basically actual high end Intel performance cores combined with an FX8350 w/ slightly lower clocks for the E-cores (on a more modern process) combined with the old instruction set.

You could try that and see, but I still wonder about the performance considering the I/O for the whole system would go through the Atom cores. I know that many windows I/O tasks are still single threaded and that could present a serious bottleneck.
I know that my J1800 Atom cores could only saturate 1gbit over SAMBA @ 2.6Ghz boost… otherwise they could only muster around 400mbit. When it comes to things like PCI-E SSD’s (NVMe obvoiusly) the I/O throughput is much higher. However we aren’t talking about the SAMBA protocol itself, but it is just an example. All my Atom machines (and I have tested tons of them) have serious I/O issues and this is part of what makes Atom machines FEEL so damned slow.

1 Like

I can’t speak to windows much/at all, this is mainly Linux focused endeavor. (perhaps down the line BSD, but that’s way out of my area of knowledge)

Thanks for sharing the Atom data :thinking:

data on ecore’s only is hard to come by, but thankfully the frequency is significantly higher than it used to be

Yes at 3Ghz+ they should be able to sustain decent throughput assuming that they don’t have the same I/O bottlenecks with NVMe, etc.
I have tested tons of mini PC’s with modern Atom CPUs and they still seem to struggle with I/O in various ways.

Time will tell if they can process background tasks only to create a more secure barrier. Although with Intel doubling the E-Cores for Raptor Lake; they definitely won’t want to support relegating them to background tasks as they are the “foundation” of the gen-on-gen increase in “performance” that Intel is claiming. They need those E-Cores to drive home the productivity wins that Intel is counting on.