I was thinking about the caches on a cpu and how they are utilized and all that. It got me thinking about HBM on the upcoming APUs (last rumor I heard was that non-Zen APUs with HBM would be released in the next few months for AM4). My thought was, could HBM feasibly replace cache? I mean, ideally, the main memory would be fast enough to not need any caching. So could HBM come anywhere close to handling the job of the cache? I don't think that it is too far fetched to think that it could do the job of main memory without much problem.
While I was researching this topic, I tried to find out the latency on HBM to compare to cache memory. However, the only thing that I could find was that it was better than GDDR. No real measurements or anything. So it is hard to compare from there. But I did come across something interesting. AMD filed for a patent a few years ago for using HBM as something akin to an L4 cache which could be accessed by both the CPU and the GPU. This would make sense as an implementation which compromised between using HBM for main memory and using it for a GPU dedicated memory and would allow for HSA style applications to work effectively.
While thinking about this, I starting thinking about the rumors. Currently, people are expecting HBM to basically supplant main memory in APUs (or at least to be used along side it). With some APUs coming with more or less than others based on the tier. 32gb HBM2 isn't too far fetched, but the price is a concern. Made me think, how expensive would a processor like that be? With 32 gb of ram built in? We are talking hundreds of dollars just for the memory, then hundreds more for the processor. Could end up being expensive, but really expands the capabilities of SOCs, especially with 14nm (which AMD supposedly has access to).
I think @wendell would have something worth saying here.
that is an interesting point there...? it wouldnt be too far fetched to hear a fully integrated cpu+mobo+ram combo since we already have cpu+mobo units about in the atom line of things.
Modularity is really important to the enthusiasts, but for mainstream gaming (think Steambox), a SOC APU with HBM would be a really good option. Basically a pc version of the consoles (but with more powerful hardware considering smaller node, HBM, etc).
I highly doubt that an end user would be able to check the latency of HBM because what you are trying to measure is the theoretical time between when the gpu requests something and when it gets information back. I don't know of any way to measure that and don't think that it is possible. It should be something in the spec somewhere or something that AMD or Hynex has measured in their labs. Would love to be wrong though and to have someone measure it themselves so I could get some real world numbers, but I highly doubt it.
In theory, you could use OpenCL and attempt to do an atomic add function (kinda like a malloc) for a given block size while printing the time to determine how long such an operation takes.
I don't know if that fairly represents the TRUE latency of the memory, but the fact that you could take the same code and run it on nVidia GPUs and have an apples vs apples comparison should at least give you a fair result. I'd imagine the actual time is going to be in the nanosecond range, so you'd probably want to try and allocate a 256MB block or something, and then record the time that took, since that'll PROBABLY take milliseconds. That'll again at least give you an interesting, if functionally unhelpful, comparison between HBM and External GDDR5.
There are profiling functions for OpenCL, so it'd be relatively easy to measure how long a given function takes. I was going to provide some code examples, but I don't know the API well enough to know if they'd even compile.
Things to look at if you want to try and write this would be atmoic_add clEnqueueNDRangeKernel clWaitForEvents clGetEventProfilingInfo
With those 4 functions and some medium-high level programming experience, you could get back the information you're looking for. It'd probably make a fun project for someone interested in learning the OpenCL API.
or routers / vpn accelerators, imagine the bottleneck being the interface you use....10Gbps interfaces as a bottleneck....on a side note it would also make a cool console provided they pack tons of shaders in there,,,,lol AMD console running steamos with 32gb HBM for the "new console gamers"
How to test latency with something slower then the thing mesured? You would need an oscilloscope capable of ranges up to 600MHz 0.5V trigger. Something like this might work. To be honest, I do not want to know the price of that sexy piece of tech...
I certainly think that a CPU interposer HBM combo is in the pipeline somewhere (especially so for the server market). The thing with cache is that it's excruciatingly expensive, but HBM is not on the same die as the processing unit and can therefore be manufactured much cheaper. At the same time it's still almost as close to the die as it would be if it were on the same die so latency would still be much faster then dram. It still won't come cheap but I think there are some use cases in which it would make a lot of sense to include HBM in the CPU package.
Samsung certainly seems to think so if this is to be believed:
Of course L2 and L3 caches are still quite a bit faster than HBM. Using Intel's Memory Latency Checker it shows that L1-L2 responses are in the 20-35ns range (Haswell). HBM is supposedly designed to have faster access times, but the row cycles are still 40-48ns like that of DDR3/DDR4. Again with Intel's tool you can test that DDR3 has a peak response of about 74ns (given small data sets). So perhaps HBM is in the 50-60ns range, which would certainly be better than nothing for an L4 cache.
Honestly information on HBM's actual operating parameters are still closely guarded by JEDEC, so it's all speculation until implementations are launched and we can observe what the scope of HBM's application is.