Warning, the following involves a software guy talking about hardware :
This morning came across the PR from intel to add support for more advanced hybrid scheduling as of kernel 5.18. Was also thinking the other day about Intel Alder lake and it’s Hybrid architecture and what might be possible if we were to treat the 2 subcategories of chips as 2 different nodes on the same package.
I feel as if some if not many of the lessons learned on AMD’s recent dual chiplet systems might be applied to Alder lake and future Hybrid architectures to better support mixed ISA’s.
What we saw with the likes of the 5900x was a dual chiplet CPU with one high bin chiplet and a second lower bin chiplet.
Through various subsystems, the scheduler can take advantage of which cores are adjacent to each other/on the same CCX to better take advantage of L3 cache locality.
Similar to what would happen in older dual socket systems with the MC located outside of the CPU’s themselves to have a multi node, Uniform Memory Access system, but in this case on a single package and with 2 different installed CPU’s.
I’m thinking Conceptually to something from the socket LGA1366 days. Something along the lines of one socket having a Nehalem X55X0 CPU and the other a Westmere X56X0 chip.
Same memory support, socket, IMC etc.
Slightly augmented instruction set (westmere added some AES acceleration options over Nehalem for example)
I don’t know if such a system existed (I doubt it did).
My question is, since we already deal with telling the scheduler to keep threads away from certain cores based on their location and performance characteristics, can we extend this idea to splitting out parts of the hybrid CPU into a form of local clustering?
Interesting idea, I mean in theory it is possible. Really what needs to happen though is that background/useless tasks should be farmed off to E-Cores to free up all the resources of the P cores for the most performance. Then on laptops you would want to keep most things on E-Cores unless its something like a foreground task which benefits UX.
You could kind of set the perf levels in windows on the E-cores as the lowest cores too, but that won’t necessarily fix the problem because this is only about performance and not about power consumption or physical topology.
With the AMD chips; the “best” core is always on one die, but sometimes the second best core isn’t on the same die. AMD I think tried to configure it so that the two “best cores” for windows would be on the same die, but then each die has a “best” core as well.
There is more work to be done here as well, because placing tasks on cores next to each other are not always advantageous. Power consumption, heat, and total cache also matter. Some applications use quite a bit of cache and doing two heavy tasks at once might work better if both are on different chiplets… Just like how a cluster of Atom cores (E-Cores) might do better than 2 P cores which are generating way more heat.
So everything is a trade off, the question is how to make the subsystem intelligent enough to pick the proper task shifting and topology usage based on the type of task or even the application itself, since not all applications in the same “type” of application necessarily work the same way or use resources in the same manner.
That could work as well. Kernel tasks though are quite simple in general and I doubt they could keep the E-Cores busy. Also if all I/O had to go through the E-Cores it might lead to lower performance since those are technically Atom cores which aren’t great for I/O.
Well that would require too much time and cost. E-Cores are just Atom cores which are funnily enough very similar to Bulldozer cores. They use Clustered Multi-threading and this is why they have such nice parallel performance while having such bad single thread performance.
Its basically actual high end Intel performance cores combined with an FX8350 w/ slightly lower clocks for the E-cores (on a more modern process) combined with the old instruction set.
You could try that and see, but I still wonder about the performance considering the I/O for the whole system would go through the Atom cores. I know that many windows I/O tasks are still single threaded and that could present a serious bottleneck.
I know that my J1800 Atom cores could only saturate 1gbit over SAMBA @ 2.6Ghz boost… otherwise they could only muster around 400mbit. When it comes to things like PCI-E SSD’s (NVMe obvoiusly) the I/O throughput is much higher. However we aren’t talking about the SAMBA protocol itself, but it is just an example. All my Atom machines (and I have tested tons of them) have serious I/O issues and this is part of what makes Atom machines FEEL so damned slow.
Yes at 3Ghz+ they should be able to sustain decent throughput assuming that they don’t have the same I/O bottlenecks with NVMe, etc.
I have tested tons of mini PC’s with modern Atom CPUs and they still seem to struggle with I/O in various ways.
Time will tell if they can process background tasks only to create a more secure barrier. Although with Intel doubling the E-Cores for Raptor Lake; they definitely won’t want to support relegating them to background tasks as they are the “foundation” of the gen-on-gen increase in “performance” that Intel is claiming. They need those E-Cores to drive home the productivity wins that Intel is counting on.