New memory standards - what comes after DDR5&/C/R//UDIMM?

I’ve noticed more and more noise about the need for new form-factors that are to unleash higher speeds as existing DDR5 DIMMs get all wobbly with more than one rank at higher speeds, let alone more than one DIMM per channel.

In that respect, DDR5 is a dissappointment. Plop two sticks in the same channel and you are back to the DDR4 speeds.

Same with the capacity increase. DDR5 is the first generation where doubling the capacity has failed, So we have 24/48GB sticks

So now we have first alternate form-factors, first LP/CAMM2 ( that originated by Dell), SOCAMM2 (nVIdia), M/C/R/DIMM version (=basically RAID-0 interlieve of 2 old RDIMMs on one MRDIMM) etc.

And not Tachiyum is coming out with their version of it:

wccftech_com / tachyum-tdimm-ddr5-memory-1-tb-capacity-per-dimm-5x-bandwidth

As I understand it, TDIMM is basically just pair of MRDIMMs with a bit of ECC trim-down. o 128 bits per stick, ALL I/O registered ( thus low loads/high speeds) and coresponding speed and capacity increase.

Whatever one thinks about Tachiyum, it’s clear that they are reacting to the market needs.

Which makes me think - will DDR6/AM6 transition bring more than the usual evolutive changes ?

These new memory sticks would bring much more: radical increase in both capacity and speed, much more potent APU options etc etc.

How and when do you expect this to unfold ?

That is untrue, DDR5 has 32Gb dies shipping now, DDR4 was limited to 16Gb dies.

I can guarantee you this will never see the light of day, as all of their wonder architectures they promised were months/a couple of years away for the past 7 years. This company is the theranos of the computer hardware world.

Even if someone else trustworthy took up the project and promised to develop the architecture as tachyum promises it now, the memory throughput of the system would be abysmal due to the lack of a *good solution for the memory concurrency issues in their design, what they have described simply won’t work performantly; they’d be lucky to even get the performance of sampling gen2 MRDIMMs out of these hypothetical TDIMMs.

The market has already decided on how to address this need; HMC lost and HBM won. We’ve already got a roadmap going out for ~10 years with this technology that is highly likely to be executable unless advanced packaging hits a wall in development.

Yes, we have them now. How long did it take ?
It used to be that capacity doubled not long after the first wave of new DRAMs, if not at the first wave.

And those 32GB chips, bedsides being late, have awful latencies, so they aren’t a simple “2x” uplift, all other things being equal.

Or so couch generals love to say. But they have a court case with Cadence, and it doesn’t seem like Cadence is taking it lightly.
For all I know, they might be right.

But even if they are wrong and TDIMM ends up being a pyantasy,
underlying principle is not. TDIMM is basically repackaged pari of MRDIMMs.
And MRDIMM is being pushed by real players big time. Same with (somewhat orthogonal) LP/CAMM and SOCAMM.
Something is obviously about to happen.

Who says so ? Because AI bubble players went big on HBM ?
HBM obviously can’t work on server farms, where people need extensibility, repairability, upgradeability, many different configurations, even transferrability (CXL) etc.

You are mixing apples with oranges here IMO.

My memory might be alittle fuzzy on this, but didn’t it take DDR4 5-6 years to go from 8Gb to 16Gb dies?
DDR5 has made the doubling in 4-5 years depending on what counts as “available”.

A lion’s share of the less than linear memory performance scaling problem has been the substandard IMCs the CPU companies have been putting out, one of the companies being more guilty than the other.

That has nothing to do with latencies of 64GB sticks.
IIRC at 6400MHz, you can get 48GB sticks with 30 clk Trc and 64GB sticks with 42 clk Tcl. :scream_cat: :grimacing:
Ouch!

I would have loved to be a fly on the wall in the settlement proceedings. While the terms were never disclosed, the fact that it wasn’t outright settled in tachyum’s favor mean they were at some level of fault, which to me says they didn’t know how to or failed to implement the IP they bought from Cadence correctly.
If I was Cadence I might have cut them a discount to keep them as a customer and keep getting paid instead of leaving them with nothing.

It does certainly seem that way, but I don’t think the power consumption growth trajectory of these is tenable even short term, we’re already at the point were we need to double up on PMICs on large MRDIMMs for 50 watts power per DIMM. I don’t know the numbers off the top of my head but even older HBM is probably at least to an order of magnitude more efficient per bit transferred.

I think SOCAMM2 could actually gain some traction because it has Nvidia behind it. LPCAMM seems more like a solution searching for a problem to solve.

At this point it’s consensus from all the established chip makers through their current designs and their roadmaps, HBM is being put in a very large portion of the high end datacenter products. Not just the AI chip makers but HPC as well, last week the HBM equipped EPYC CPUs hit general availability as an example.

I agree that post-manufacture memory flexibility would be nice, but I think we’re getting to a point where that flexibility’s downsides in power consumption and performance outweigh it’s utility.

That particular example might just be because we don’t have good binnings of the 32Gb dies yet while the 24Gb dies are more mature in their bins. That being said I doubt the 32Gb dies will ever be as light on the memory controller as the 16 or 24GB dies were; IIRC some of the 24Gb die DIMMs actually outperformed 16Gb die DIMMs, specifically when comparing Hynix M-die to Hynix M-die of the two different sizes.

Fine, but even without those dissapointments, the fact remains that with new generation we’ll get 15% faster cores and 50% more of them ( so 12-cores and 24-cores with Zen6 and 16-cores and 32-cores with Zen7, each core massively faster than Zen5).

Both of those are expected on AM5, so with DDR5.

DDR5 as it is can’t really feed all those cores. It isn’t scaling frequency-wise.

LP/CAMM2 is frequency-wise a dissappointment. It’s expensive and yet doesn’t reach higher than UDIMM.

This only leaves more radical approach that MRDIMM uses, even if it gets combined with better signal integrity of LP/CAMM2.

And/or maybe of-the-shelf stuff switched to LPDDR5 ?

As you’ve partially noted with 16 and 24 Gb Hynix M, timings are mainly a function of a die’s ability to access its cells and thus subject to internal optimizations and speed-power tradeoffs independent of bus signal integrity. The IO loading’s a function mainly of packaging and thus pretty much constant within a given DIMM organization. Recall, however, DRAMs’ IO blocks control both eye opening on receive and DQ and CB drive back to the IMC. So, while increasing the number of cells increases die size and alters placement within packages (and maybe also package selection), the 16 → 24 → 32 Gb steps also offer opportunities to bump DRAM clocks by reving their Rx and Tx designs. While changes can be backported by stepping earlier dies, masks are expensive.

Though I don’t know of specific evidence I’m pretty comfortable assuming joint selection of IO, clock, and timing capabilities as function of DRAM design for price-performance. Given the B-die-M-die market segmentation, Micron doesn’t particularly need to support 4 GHz clocks where Hynix does. And there’s probably not a lot of demand for bandwidth that’s decoupled from latency.

I’ve found energy use metrics hard to come by and difficult to compare. Studies that do measure pJ/bit or such tend to be looking different details rather than total memory subsystem power. But orders of magnitude seems quite optimistic. Probably more like half for HBM3 versus GDDR7 (~2.2 and ~4.4 pJ/bit-ish). For DDR5 I haven’t found good published numbers but the peak SPD reported powers I’ve gotten for Hynix M are around 1.2 pJ/bit (Hynix is at 650 fJ/bit for LPDDR5).

Also, since HBM shrinks to a 3D stack connected on an interposer it’s both bespoke and higher density, so less flexibility and more involved thermals. That says basically L4 cache to me, though Nvidia’s use is technically L3.

V-cache is something of a different solution to somewhat different issues but it seems to me interpreting 3D stacked memory tiles on CCDs as an HBM equivalent for a desktop price point isn’t entirely without merit. If the rumored 32 p-cores on AM5 happens I expect there’ll be a lot of wailing here on L1 about dual channel DDR5. But if the caches are good it’ll work fine for most all workloads.

FWIW Micron claims 70% power reduction, 64% space reduction, less soldering, and 1.5x clock compared to SODIMM. Doesn’t particularly matter in a server or desktop context but is relevant to the notebook use LPCAMM’s intended for.

Though, seeing most 2x32 GB SODIMM kits are currently in the US$ 700-800 range here, if LPCAMM2 mainstreams anytime soon the early adopter premium’s probably going to be brutal.

1 Like

I wouldn’t be surprised if the current DDR5 *CAMM(2) products are not aiming at mass adoption, but that perhaps when DDR6 is specified from the get-go for CAMM-style modules then the improved signal integrity could reduce the design headaches for mainboard vendors enough for it to achieve mainstream adoption in some form.

From what I’ve read nearer term expectations all seem centered on limited use of LPCAMM2 for higher end laptops. Then CAMM2 DDR6 in 2027.

I figure the DDR crunch’s probably killed any experimental CAMM2 DDR5 launches that might have been planned. Stranger things have happened, though.

Or maybe they finally decide to stop faffing around with DDR and use those fancy new fiberoptic-on-a-chip for PCIe and RAM connection also ?

I suspect that would neccessitate DRAM organization change (e.g. 256-bit wide massive RAM fields) and possibly bring couple other novelties. Like simple vector computing in memory (working with whole rows/columns at once, pattern search etc).

This is a big problem that has been compounding for the past decade+ among processors.

In my search to find energy consumption figures for different memory technologies I ran across this paper that shows how memory starved CPUs are by comparing energy normalized performance of a Sapphire Rapids DDR5 CPU+memory combo to a Sapphire Rapids HBM2 CPU+memory combo (first and second columns):

same CPU architecture saw power consumption roughly cut in half by switching from DDR5 to HBM2E. Now I know there are a lot of other variables in this benchmark beyond memory energy usage like the time a CPU has to run idle because it’s wait for data from memory, but it does illustrate the point.

Fine, but HBM isn’t panacea.
It has multiple order of magnitudes shorter traces and better signal integrity.
We all understand that.
But people like to be able to upgrade, replace or swap their memory when needed.

If your memory stick dies, it’s a big difference whether you have to replace only that stick or the whole friggin SoC or even motherboard with it.

I can see cache dies become obligatory part of new systems.
Maybe even some eDRAM dies as separate banks.
But people often need detachable/replaceable RAM and HBM won’t fix that.

Also HBM has its own problems. For one, it needs connectivity along the edge of the chip. Those are often scarce resource.

Also, at only 2x efficiency of DDR5, HBM is kind of a poor choice.
All that extra cost, design constraints and non-expandability for what - only 2x DDR5 energy efficiency ?!?

I suspect this is a tradeoff that many won’t like.

That’s a good point, I think I was conflating the two in my head at the time I wrote that comment.

Yeah it seems the conditions that memory is accessed in heavily contribute to it’s energy efficiency and there aren’t consistent measurements across different technologies/tests.

I found a paywalled paper that tries to compare different memory technologies as apples to apples as possible:

As a comparative note, Samsung’s HBM3E is supposed to run at ~64.8% the power per bit transferred that HBM2 did, which does put it at an order of magnitude power power efficient than DDR5. I’m mostly comparing this to “high performance” DDR as opposed to LPDDR which does seem to come much closer to HBM in efficiency.

This reminds me of the “1/2/4 stack” 3D V-Cache option that started showing up in the threadripper BIOSes, if AMD can start stacking multiple V-Cache dies that would relieve a lot of the memory pressure that modern CPUs have.

For some reason I though there were SODIMMs with LPDRAM on them, apparently there are not so I guess other pluggables with LPDRAM on them could be warranted.
-Maybe it was the low voltage variant of DDR I was thinking of like how we had DDR4L, but apparently there is no DDR5L.

This flexibility is preferred, but at what point would you give up the flexibility for more performance?
For me that number is probably around 1.5-2 time more performance for most use cases.

I’ve wondered about what the failure rates are for HBM, because if it dies it requires some really expensive things to be replaced along with it.
I’ve got some HBM CPUs and there are some weird extra memory provisioning options in the BIOS that imply some cells could go bad and the CPU will just use “spare” cells.

That’s 2 times the efficiency of the the entire system for a given task which is pretty substantial, image halving the TDP of the system and getting the same performance “just” by swapping out memory (I say just swapping out but it obviously isn’t that easy).

I’ve looked at both the ones you’ve mentioned, though Zhao et. al 2022 definitely isn’t paywalled. Something to be aware of is MDPI has a notoriously lax peer review process, has been on and off predatory publisher lists, and as a result some university systems will not fund publications with them or consider MDPI papers in their promotion and tenure processes. If you do basic due diligence and check Table 1’s citations one is a data free Hynix press release and the other’s a Synopsis blog with a HBM2 PHY specific pJ/bit bar graph that doesn’t include DDR5 or HBM3.

So probably Table 1 is made up. The DDR4 and 5 bandwidths, pin counts, and capacities have obvious errors. The HBM columns I’d bet started by plagiarizing Wikipedia (note the 307 GB/s error) and I think it’s worth asking how HBM can have a pin count when it doesn’t have pins. I wouldn’t be surprised if an LLM managed to hallucinate more accurate pJ/bit totals.

Besides measured DDR5 pJ/bit the HBM3 number I mentioned is an average of Micron and Hynix’s datasheets. I’ve had a harder time with HBM2 but Micron’s press release claims 50+% reduction from HBM2E to HBM3, implying 8.6+ pJ/bit for HBM2E per their HBM3 datasheet. Which sounds high, so I suspect the press release has a basic math fail (maybe it’s more like 6.5 pJ/bit). I haven’t had luck trying to work back from HBM2E to HBM2.

Assuming Zhao et al. didn’t also make up their results, what they did is look at putting hot Redis data in HBM. Their methods don’t say how the energy metrics in Figure 6, 7, and 9 were calculated. DRAMsim3 seems to have been involved but it’s not indicated which DDR and HBM versions were compared and, making the dubious assumption Zhao et al. set up the sims with accurate values, #53 doesn’t exactly give me a lot of confidence.

Mmm, Drávai and Reguly 2024 have Granite Rapids roughly halving Sapphire Rapids power consumption. I’m not so sure their results are saying HBM is good as their results are saying Sapphire Rapids is a mess. Which we knew already, really, just not as specifically on this particular point.

Phoronix recently benched HBv4 (9V33X v-cache) and HBv5 (9V64H HBM3) but unfortunately doesn’t have power in the results. I’ve failed to find anything specific for Nvidia as well. Just handwavy stuff about gaining energy efficiency. It’s tempting to interpret that as an indication the power reduction isn’t particularly remarkable since, if it was, probably it’d get stated more specifically.

Another hint might be Intel not continuing Xeon Max. With Intel’s situation I’m hesitant to read much into it but, if was consistently a large power reduction, I’d tend to expect at least a few follow up parts to have happened in the past couple years.

I am just waiting for DDR5 8000 CL28 So the latency is the same as DDR4 4000 CL14 or DDR3 2000 CL7.

1 Like