Differing CPU setup?

insomniac_lemon · February 15, 2017, 2:39am

Note: This is more of a discussion piece, not asking/suggesting to to this the current OS/firmware/software.

I'm wondering how possibly this would be, particularly with Linux and low-level (kernel level, maybe?) nodes/code to make it possible. Basically I mean taking a CPU you would use now for gaming and add another processor with a bunch of cores that run at a lower frequency.

This could take one of 2 methods:

Multi-CPU motherboard. With high clock/IPC single/dual/quad core CPU and one (or more) with a high core/thread count (20+).
An actual separate swarm machine made up of a cluster of Raspberry Pis or similar microcomputers, hooked up through ethernet. I think the issue here would be actually integrating it into the system where it would see the cluster as additional CPU cores/threads and integrating it where they can be intelligently scheduled work, and have it sent to them directly like a real piece of hardware rather than needing the same OS and program setup.

In either case... with something like rendering using Handbrake, the 'swarm' threads would get their own frames (likely from the end of the workload). Near the end of the job If the faster threads do them significantly faster (they likely would) then the frames would be given to them instead of the swarm threads.

Things like games would likely use the 'main' CPU unless it could saturate all of the other threads (and what was left is small and not too needed instantly) whereas the swarm could be used for anything highly parallelized (and again, in some cases BOTH could be used).

Just some thoughts for both high thread+high frequency/instruction (my other thought being a CPU having 2^n cores with lowering frequency: [email protected], [email protected], [email protected], [email protected]) in one system without drawbacks of traditional clusters (like needing recompilation of software to gain any benefit).

Of course it wouldn't be used to full capacity in most scenarios, but even simply best of either world would be good. It'd all be up to how well applications support multithreading and how intelligent CPU scheduling could be (taking care to give jobs to the threads that are most suited to complete them).

anon94261841 · February 15, 2017, 3:08am

I believe some cell phones use something similar to what you're talking about, they have a couple low power cores and a few high power cores that are activated on the fly. It's the ARM big.LITTLE setup.

https://www.arm.com/products/processors/technologies/biglittleprocessing.php

insomniac_lemon · February 15, 2017, 6:28am

That's spot on!

I hope desktop CPUs start to go in this direction, particularly with dual-core CPUs (and quad-core to a lesser extent) since those are starting to make less and less sense.

It would certainly make CPUs age better as more programs give minor tasks the ability to be on a dedicated thread (or any situation making CPUs more thread-bound and thus IPC-bound but NOT frequency-bound).

Zibob · February 15, 2017, 11:16am

I dont think it will gappen, Ungeneral Desktop has the equivalent of infinite power as far as electricity is concerned, the big.LITTLE arch is for power saving primarily. Desktop applications while not overly stressing the CPU a lot of the time benefit nothing from these power saving and might be slowed down by having to switch CPU context from slower to faster cores as demands. Right now what we have is slower and faster speeds on any given core where it will just run slower when not needing full power and then ramping the clocks as the demand rises.

I think it would be a step backward in term of performance to try this in x86 based machines. They have developed completely separately from ARM and use resources differently as a result, the programs are also written to take advantage of this so it would be a step back there too.

Nice idea though and had x86 started today it might well end up with this very model, but it started long before mobile was an idea and ARM deals with the limitations of mobile much better due to having a long history of tech to draw from and do differently becasue it is new and has to deal with these issues.

jak_ub · February 15, 2017, 11:20am

ad. 1. In part that would be basically return to the idea of floating point co-processor that originally was separate and optional chip on the board.

Basically you are talking more or less about Xeon Phi coprocessor (but I think it has its own instruction set).

On some abstraction level it could be said that GPU is co-processor too.

ad. 2.

and have it sent to them directly like a real piece of hardware rather than needing the same OS and program setup

with something like rendering using Handbrake, the 'swarm' threads would get their own frames (likely from the end of the workload). Near the end of the job If the faster threads do them significantly faster (they likely would) then the frames would be given to them instead of the swarm threads.

Here is the big problem - latency in communication. It is one thing to have a server spit the work between nodes via network (and then each node is doing it own separated job). The other is to have virtual CPU where cores need to communicate with each other (to pass the result of execution) and the further they are (e.g. network vs single chip).

Usually the more you split the work into the smaller pieces the less latency in communication your computing solution need to have (and splitting work between cores of CPU is small already - taking general programming into account).

So I would say 1. is better approach as 2. was divided into the optimized applications (super computers, clusters, client-server/peer applications).

insomniac_lemon · February 16, 2017, 7:08am

You seem to misunderstand the issue here (EDIT: the one I'm talking about).

It's NOT energy efficiency, but cost of hardware. Going with a dual-core or quad-core because you intend on gaming, and thus it doesn't make sense to spend money on full extra cores.

In a way, it's similar to multiple threads per-core (like 4c/8t)... it's less performant than a higher number of cores (and 1 thread per core) but you save money.

Similarly 'little' cores wouldn't cost as much, but would speed up thread-hungry applications that don't actually do much but are highly parallelized. Or as I've said, for smaller jobs when the main cores are already busy.

EDIT: It's also similar to L2 and L3 cache, you have more are a slower speed, but they get used for things that don't need speed as much (or when main is completely full).

I was talking more of a cluster likely in the same room as your main computer, separated by 1 ethernet cable and a switch. That and your computer would act as a scheduler, controlling which cores get what... they shouldn't need to communicate, each part of the cluster would be a core (if that was ever needed, maybe there would be some way you could accomplish that without needed to go back to the main computer?).

So would latency really be an issue? Especially considering that this would be more for operations where latency isn't important (think rendering, either pieces of a frame or a full frame itself) or when all threads on the main machine are already populated (and likely will be for a while) anyways... the result should be finishing something before a main core could start doing it, and thus less overall latency.

But as I've said, 2 shouldn't need optimized applications but should have a smart scheduler. Or even user input.

Marten · February 16, 2017, 12:31pm

Apple do the low power CPU to do push msgs and keep real time data up to date.

On a desktop I want max power all the time. more power the better and I want control of it myself not apple or google.

On a laptop well its back to batteries and power usage . A baby CPU might work but I want the same OS on my laptop as my desktop so rock and hard place.

jak_ub · February 16, 2017, 1:51pm

Those statements kind of cancel each other.

Zibob · February 16, 2017, 1:51pm

Yes I can see that. I am approaching it from cost standpoint too. To replace all that has been developed would be a step back and more expensive as everything from hardware to software needs to be new. Hence x86 is old and the way it is and ARM is what it is.

Yes power and everything esle. It is what the CPU does so power being the first important metric to hit. If we were to go to a new system it would have to start at the levels of performance we have now while also being able to save power and developing new everything. 30 odd years of development, very expensive to change that, even if the chips are cheap no one would seriously go for it.

insomniac_lemon · February 16, 2017, 1:59pm

Same as I said to Zibob above.

Optimized for high parallelization, yes. Possibly profiles for different software (enabled/disabled, auto-detected after normal run, dev added etc) but the same binary that will run on a normal 64-bit machine.

I was saying optimized for clusters (like recompiling software specifically for a beowulf cluster) should not be needed.

Basically, you get benefit or you don't... but everything is installed normally. I mean obviously, not everything can be highly parallelized so wouldn't gain benefit anyways.

jak_ub · February 16, 2017, 3:31pm

I'm sorry I cannot so lightly accept approach 2 in the way we discuss it. Do not get me wrong it is doable, but not feasible.

For example, I still see one particular problem. Handbrake as a application running on multiple systems does the work efficiently because it knows how to split the work, and when spiting it, it takes into account that those will be separated processes that best work if input is sent (fragment of a video stream) and then that "core" process it for some time then uploads the fragment of t solution to the main Handbrake instance (this is how in simplify way I see it - never used handbrake that way).

I could see special version of Handbrake working optimally on our approach 2. With one exception, it is application scheduler that does this, from the OS point of view it is yet another thread that OS puts whenever it can. Putting that scheduling responsibility on on OS is possible - but that what application-specific OSes are for. So in my opinion is is highly non beneficial to put that responsibility on general purpose OS.

Then we could think about letting know the OS that the specific thread is separated from the rest of the threads and could be put on remote "core". But thats what processes are for. Process is basically a box that says to the system "hey I'm quite in depended from the others". So basically we are slowly going back to the cluster of independent systems idea.

The cost of distributed OS is much much higher that just buying general OS license (and I think it is also fair to say that there is also higher cost if you would find free distributed OS) .

For me cons heavily outweigh pros.

And, still, there are other complications that would need to be solved. As lets say half of multi-threaded application would never benefit from. E.g. transactional applications/services - absolutely no one would be happy with CAP theorem problems introduced by simple fact that there is a possibility that some threads are remote - whereas on the same chip those problems can be mitigated) .

So basically after some experiments: "However, this opportunity comes at a very high cost in complexity.":

So to summarize my opinion:
- either you have something like CPU + N*GPU or generally CPU + N*coprocessor (e.g. Xeon Phi)
- or cluster of independent nodes - that is much cheaper
- solution that is between (aproach 2) as resembles distributed OS is too complex and costly.

The mentioned in this thread solution that Apple is doing - is not as much costly and complicated but thats because it is still on the same board and slave CPU runs only complementary applications (if not only when main CPU is off ).