New memory standards - what comes after DDR5&/C/R//UDIMM?

!!!
I was looking everywhere for that paper on Thursday and for the life of me could not find it; I must have been mentally filtering out non-PDF search results when I was looking.

I’m just happy they aren’t paywalling me.

…I don’t understand how these things get published.
This does point to Synopsys as a good source for information that I hadn’t thought of before.
Looking around at their material I found this:

There isn’t a scale on the graph, but it looks absolute to me. Also I’m not 100% convinced that only looking at the PHYs power usage is exactly what we want; but I don’t see an alternative with how complicated it would be to peel back energy usage in specific parts of an architecture.

Well GNR is on a completely different node than SPR so it’s going to get energy consumption benefits by virtue of the compute going from Intel’s 10ESF/7nm to 3nm.
I think the real thing to pick on here is how purposefully memory starved the benchmark was.

I’ve been using the HBM instances and I’ve been very impressed with their performance.

I don’t know what it is but it seems like all the CPUs with HBM end up having issues being brought to market.
The way I heard the story of the HBM EPYC’s development was that basically AMD killed the project halfway through development and the only reason it saw the light of day was that Microsoft came in and promised to foot the bill for the development of the CPU-only version of the MI300 series.

Speaking from somewhat earlier personal experience with one of MDPI’s apparently better journals, which I rather suspect Micromachines isn’t among, peer reviews are usually perfunctory. Like maybe skim the paper and jot a few remarks, so maybe 20 minutes of time. This results from MDPI management setting time from submission to online as their key performance metric. MDPI’s copy editorial staff’s not dumb, so they’ve followed that mandate by increasingly optimizing to shove articles onto the website.

Springer and Frontiers seem as bad as MDPI, if not worse, about quality control. But they’ve been less clueless about managing the optics, so haven’t gotten as much of a reputational hit.

The Micron and Hynix datasheets I found don’t specifically say, but appear to provide whole HBM pJ/bit (not counting the IMC on the GPU/CPU). So I think those figures are probably comparable to calculating pJ/bit by dividing DDR5 SPD reported power by DIMM size. I was also able to find papers that look at Hopper power monitoring and stuff, including mentioning specific reporting for HMB, but unfortunately nothing I came across actually had results for just the HMB on its own.

It’s not apparent to me Synopsis’ PHY power reporting controls for process node. Which is partially appropriate given generational node shrinks but also leaves latitude to cherry pick comparisons for marketing purposes since (at least some of) the PHYs they sell support a fairly wide range of nodes. It seems to me their graphs are generally relevant and broadly indicative but I have found myself questioning how much attention should be given to the specifics on occasion.

If I put on a tinfoil hat, the answer’s Nvidia. Conspiracy theories aside, HBM only alleviates bandwidth bound memory workloads if you can keep the right data in it at sufficiently low latency. Doing a good job of that presumably means reworking core designs to effectively utilize HBM alongside DDR, suggesting either the HBM won’t end up being all that effective or AMD and Intel would have to pull engineering time off less niche CPUs. I also suspect width scaling’s lower than with GPUs. Both from the shape of CPU workloads plus having to add or maybe reassign memory channels for HBM3.

For EPYC it probably makes more sense to focus on v-cache as it’s more versatile to CPU workloads and doesn’t compete with CDNA for HBM allocations. Plus I’d guess there aren’t a lot customers wanting to pay enough for HBM EPYC that AMD can make the same margins on it as Instinct. That seems to fit with how HBv5 turned out. For Xeon, getting off Intel10 Intel7 to more competitive nodes, reducing costs, dealing with layoffs, and not having to compete with AMD and Nvidia for HBM allocation. Plus some of what I’ve read suggests Xeon Max was motivated by increasing HPC performance when clustered with Ponte Vecchio, so it might be end of line because of the switch to Jaguar.

GPU adoption seems to me to have been simpler as HBM’s a straightforward upgrade from GDDR. Not exactly drop in, but fairly close.

1 Like

CPU with HBM (and GPU) being in a socket is already a thing (MI300A).
Then add some form of system memory, DDR/CAMM/etc.
For even more memory: CXL

The win is flexibility and scalability for the manufacturers since the base system is basically an SOC, but can scale by populating more devices/slots on the mainboard.

This reminds me of the hollowing out of the utility of h-index when it started to be abused to boost authors’ standings.

I think it was only Intel that ever went down the path of having HBM as kind of a transparent “L4” cache, all the other designers completely replace GDDR/DDR with it… and the only reason Intel could get away with using it like this is because of their horrendous LLC cache coherency scheme that adds so much baseline latency to even SRAM LLC that the extra latency from DDR>HBM caching wasn’t very noticeable.
Intel to their credit did let you switch memory modes to a pure HBM memory as well.

In Intel’s HBM CPU design the memory controllers get reassigned from DDR to HBM on the fly, or at least that’s their apparent behavior according to numactl’s weighted interleave access patterns (bandwidth degradation when letting the kernel swap between memory types in a round robin fashion).

The Aurora super computer was another driver from what I heard, although it didn’t turn out to be as successful as hoped.

1 Like

I guess for the foreseeable future as consumers the question is gonna be if you already have RAM … because you’re not getting any for a while.

To my knowledge as well. AMD 2024 might be of interest here if you haven’t already looked.

Yeah. Or just if you have already hardware for most things outside LLM bubble funding, since if it doesn’t need (G)DDR/HBM it needs NAND or 3.5. Like, ok, you can update a case or PSU and maybe drop in a new CPU or swap a mobo. But that has pretty sharp limits.

Give people a performance metric to optimize for and that’s what they do. Shocking. :yay:

Like the academic side of this stereotypes down to whether you want to be somebody who spends their time getting their name on other people’s papers or whether you want to actually do useful work. Neither offers much of a sustainable career, with the unsurprising result of loss of research quality because it’s primarily getting done by inadequately mentored graduate students and then channeled through a scarcely resourced peer review system.

I guess it sort of works anyway because most scientists want to do good work despite being systematically incentivized not to, making for a big brain system which kind of runs on selective stupidity. Weirdass approach but it seems unlikely to me that it’ll break hard enough to motivate large scale reforms.

1 Like