Why high end PC's don't have socket 1150 Intel k CPU's

Ah is it, thanks didnt know :) well one step closer to my dream lol 

This is true, but at the start it wont APU with DDR4 RAM thats expensive, doesnt work for what the APU is designed for, maybe a while before we get a true DDR4 APU setup, then let the console wooping begin, Oh wait it already has... lol

Not with current APU platforms, of course. But once DDR4 becomes the new standard... :)

I wasn't talking about gaming. Anything that cannot fit into the cache goes to the main memory. So if you are working with big datatables it needs to access the memory.

 

It needs the information somewhere, and not everything can fit into the cache.

 

Also i doubt HSA will be used for smallere workloads, where the SIMD cluster is shinning.

It doesn't matter if it's a bigger or smaller workload. It matters what type of workload it is and how it is handled. Here's a more in-depth explanation of HSA from AMD's website: 

"Since their earliest days, computers have contained central processing units (CPUs) designed to run general programming tasks very well. But in the last couple of decades, mainstream computer systems typically include other processing elements as well. The most prevalent is the graphics processing unit (GPU), originally designed to perform specialized graphics computations in parallel. Over time, GPUs have become more powerful and more generalized, allowing them to be applied to general purpose parallel computing tasks with excellent power efficiency.

Today, a growing number of mainstream applications require the high performance and power efficiency achievable only through such highly parallel computation. But current CPUs and GPUs have been designed as separate processing elements and do not work together efficiently – and are cumbersome to program. Each has a separate memory space, requiring an application to explicitly copy data from CPU to GPU and then back again.

A program running on the CPU queues work for the GPU using system calls through a device driver stack managed by a completely separate scheduler. This introduces significant dispatch latency, with overhead that makes the process worthwhile only when the application requires a very large amount of parallel computation. Further, if a program running on the GPU wants to directly generate work-items, either for itself or for the CPU, it is impossible today!

To fully exploit the capabilities of parallel execution units, it is essential for computer system designers to think differently. The designers must re-architect computer systems to tightly integrate the disparate compute elements on a platform into an evolved central processor while providing a programming path that does not require fundamental changes for software developers. This is the primary goal of the new HSA design.

HSA creates an improved processor design that exposes the benefits and capabilities of mainstream programmable compute elements, working together seamlessly. With HSA, applications can create data structures in a single unified address space and can initiate work items on the hardware most appropriate for a given task. Sharing data between compute elements is as simple as sending a pointer. Multiple compute tasks can work on the same coherent memory regions, utilizing barriers and atomic memory operations as needed to maintain data synchronization (just as multi-core CPUs do today).

The HSA team at AMD analyzed the performance of Haar Face Detect, a commonly used multi-stage video analysis algorithm used to identify faces in a video stream. The team compared a CPU/GPU implementation in OpenCL™ against an HSA implementation. The HSA version seamlessly shares data between CPU and GPU, without memory copies or cache flushes because it assigns each part of the workload to the most appropriate processor with minimal dispatch overhead. The net result was a 2.3x relative performance gain at a 2.4x reduced power level*. This level of performance is not possible using only multicore CPU, only GPU, or even combined CPU and GPU with today’s driver model. Just as important, it is done using simple extensions to C++, not a totally different programming model."

Hopefully that explains it a little better for you. As you can see this actually eliminates significant redundancies and latency in memory read/writes and further reduces CPU overhead as well. 

Source: http://developer.amd.com/resources/heterogeneous-computing/what-is-heterogeneous-system-architecture-hsa/

HSA is about making the GCN cores into a first level processor. I do know that.

 

It doesn't eliminate the memory bottleneck. 

Let med cut it down even further:

You said it can process things much faster.

 

Lets sat; it can process 20 at once (random number) per cycle.

So it would need atleast 30-60+ elements to process.

 

Higher throughput needs more elements (data) to work with, this is simple logic. 

 

What you bolded simple means that the GCN cores share the I/O pipelines.

It doesn't eliminate the memory bottleneck - it removes the redundancies and inefficiencies of the current process/method of using GPU cores for parallel workloads by removing the need for extra read/writes to the memory and unnecessary CPU pass-through/processing. 


It's a tremendous improvement as such tasks can now be completed in less than half the time. While it doesn't remove the memory bottleneck, it does significantly reduce the bottleneck effect by requiring far fewer read/writes.

Normally the only first level processor is the CPU. This means that if you want to make any other processor do any work, it needs to go through the CPU. HSA makes the GCN cores part of the first level processor.

 

The GPU can do incredible high throughput, but with the cost of extreme high latency (needs to go through the southbridge first), so there was an area where the workload would be bad either if you choose to go with the SIMD option (lower throughput, and low latency) or the GPU option (high throughput, but high latency).

 

HSA will bring higher throughput with lower latency, but it will be affected by the main memory bandwidth.
You cannot process air.

 

For lower SIMD workload, they will choose they regular SIMD cluster.

For the middle SIMD workload, they will choose HSA.

For the higher SIMD workload, they will choose the GPU.

 

The problem before was simply that there was no "middle", so at that point the performance would be bad either way.

 

Just like there are only few workloads that pays off having the GPU doing the SIMD workload, there will only be a few workload that pays of having the GCN cores doing the SIMD workload.

 

The bottleneck before wasn't the memory bandwidth, but the latency.

There will always be bottlenecks. The bottom line is this allows for certain tasks to be completed more than twice as fast while reducing CPU overhead at the same time. No matter how you cut it, it's a much more efficient use of hardware and a significant advancement over how things work by current methods/architecture.

There have also been hints at possibly using the GCN cores during gaming with a discreet GPU. Not to be used in cross-fire, but to off-load some/certain tasks to the GCN cores leaving more headroom for the GPU cores to push even higher frame rates. If the GCN cores have direct access to the vram, this could potentially increase discreet GPU performance well beyond their own capability. I don't know how feasable this would be, but they did talk about it during on of their earlier seminars on HSA integration.

There are more ways than one to implement and take advantage of HSA. 

No matter how you cut it, it's a much more efficient use of hardware and a significant advancement over how things work by current methods/architecture.

I was simply clarifying where HSA will be used. It is not that much more advanced that our SIMD cluster found on the x86 cores themselves. It will not replace SIMD clusters or the GPGPU, it will fill the whole between them.

 

If the GCN cores have direct access to the vram, this could potentially increase discreet GPU performance well beyond their own capability

I don't quite understand this statement

 

There are more ways than one to implement and take advantage of HSA.


When did I say there are only one way to implement HSA? I simply gave an example of where HSA will be useful and where it wont.

I never said you said those things. You were asking questions about how HSA works and I was simply trying to provide answers. That's it. I never said you claimed there was only one way to implement HSA. I was simply making a statement.

I never said it would replace CPU cores and or GPU cores. I said it's a much more efficient use of the those components (hardware) compared to current methods and architecture, which it is. I would consider reducing the time to complete a task by more than half a significant improvement. 

As for the statement about discreet GPU performance: What I meant was they are looking into ways by which the GCN cores on the APU could be used to reduce the overall workload on the discreet GPU, allowing the discreet GPU to perform even better than it could on it's own.

 

I never asked any questions. I never asked how HSA works.

 

I simply was wondering if HSA could be affected by the slow bandwidth from DDR3, and stated my reasons.

 

II never said it would replace CPU cores and or GPU cores

Don't worry, I never claimed you did. The SIMD cluster is better known as the FPU.

 

I said it's a much more efficient use of the those components (hardware) compared to current methods and architecture, which it is

This is a very wide statement to make. It all depends on how heavy the SIMD workload is. For smaller SIMD workload the SIMD cluster on the x86 cores will be more efficient. I previously stated on how heavy workload HSA will fit (middle) and why.

 

I would consider reducing the time to complete a task by more than half a significant improvement.

Again a very wide statement to say. For some it will, as I also described before:

The problem before was simply that there was no "middle", so at that point the performance would be bad either way. (between the x86 SIMD cluster and GPGPU)

You're right, you never asked any questions. I was merely attempting to address what you were "wondering" about how the memory bandwidth would effect HSA.

It is a more efficient use of the hardware and it is a significant improvement - specifically speaking about the types of tasks that will benefit from this. You could say my statements were general or wide, but the thread is about HSA and what it can do, specifically. So I didn't think I needed to spell out every single thing I say about HSA, specifically. Just to be clear: I am talking about HSA and the improvements it delivers over current methods of processing the same types of tasks, not the types of tasks that will not benefit from it.