Serially connected ram?

risk · April 17, 2021, 11:30am

Someone asked a question about XMP crashing apps vs os…

That got me thinking - with all the signal integrity XMP/ECC nonsense going on, wouldn’t it make sense to transfer data over differential pairs?

Are there any hardware engineers around who could explain the latency / power tradeoffs?

I’m not one, but OTOH, that would probably require going serial to cut down on the wiring needed, but how much latency would that add? And could you design ram to “run over a wet string”?

Even though ram read/write today takes 10ns on ram side of things, from CPU perspective it’s more like 40-80-100ns (there’s multiple layers of caching and prefetching and TLBs so it really varies a lot in practice from the instruction perspective)

Naively, e.g. a cache line is 64 bytes (512bits) wide ; add another 4bytes (32bits for address) and another 32bit CRC ; and 32bit header. … let’s round it to 80bytes / 640bits per read/write command.

If we have 2 wires and 1GHz signaling rate, so is this 640ns ?

What if multiple differential pairs are used, (e.g. 8pairs/16wires ?) 80ns?
What if we did multi level signaling on each (possible because of differential pairs or am I wrong?), e.g. is 4 / 8 levels ok? Is more possible? 8 that brings it down to 10ns per packet (read/write).
Is there something like QAM for these kinds of bitrates that can let us cheat here?

And what about tradeoffs in pin count (DDR4 is 260/288pins), PCBs carrying CPU silicon have tons of layers and pins, same as motherboards, all of which increases cost and complexity.

Can we have 32/64 channel memory, for our CPUs that have 128 logical CPUs in there, same as today we have dual channel/octo channel on epyc CPUs?

How could this work?
Would it even make sense?

What if we said, f***-it we’ll have 4KiB cache lines and 1-4GB of L3/L4 cache instead, and variable packet read/write length? Would that make lives easier, harder?

MazeFrame · April 17, 2021, 9:59pm

Since DDR transfers data on the rising and falling edge, going differential would just throw half the data rate out the window.

PhaseLockedLoop · April 17, 2021, 10:06pm

That’s a good bit less to do with it because you can still use both rise and fall. Thresholds would be different

Remember in a fully differential pair you’d have twice the voltage signal at the receiving end this helps with noise since it takes more noise to disturb it

We wouldn’t be throwing it out the window… But learning how to create fully differential memory architecture

Pretty sure I’ve seen some stuff but this is #2 on google

I think what you were referring to is his wording on “serially connected” in which you are correct you would lose an edge in that setup if its non differential. If I’m understanding you properly

MetalizeYourBrain · April 17, 2021, 11:30pm

Maybe access times are in the 10ns but actually storing and reading data off of it it’s around 80ns. RAM is syncronous so you can’t just dump 16-32-64GB of data on it in 10ns, even if you could write to it using a controller with the only purpose of writing to it. Same goes for the reads.

I don’t get what you’re saying. You don’t transfer memory address to access it. The CPU queues the command to read memory and once it’s ready it just passes the TLB address to the memory controller that translates the TLB entry and reads data at the given address (column + line). A memory address never leaves the CPU, it’s just mapped in the TLB at boot once the CPU knows the amount of memory that’s in the system.

What do you mean?

Not possible unless you increase the PCB layers to isolate most of the pairs. And doing so would make the PCB not cost effective and change in a significant way the lenghts of each pair between the memory socket and the CPU socket.

As soon as you’re not using a square wave anymore you need something to modulate and demodulate the signal which is gonna add a lot more latency and cost for no real benefit.

There are constraints when designing a CPU and a memory controller capable of such numbers would use a huge part of the die. This complexity would fall onto the socket and motherboard complexity because you’d need to connect in that fashion all the RAM you have at your disposal.
MAYBE it would be possible if there was a way to address parts of a DIMM and have it on a dedicated channel (a dual rank stick = one channel per rank, for example).

Variable needs logic and logic adds cost and latency. For example the issues with x86 and the inefficiencies of it, for example, are due to the variable instruction lenghts it uses. I could go on with this topic for a while so I’m gonna stop here to avoid OT.

Too low hit rateo, needs lots of nested TLBs to reach a decent hit rateo and too much space taken on the die. Not efficient.

Can you be more specific on what’s the goal of this question? I’d like to know more about it.

frred · April 18, 2021, 3:48am

I think there is a serial memory interface.
It’s called Open Memory Interface ( OMI ) and IBM is using it on Power 10.

risk · April 18, 2021, 3:53am

Brainstorming. I recently read about nvlink switches and saw another thread here and it got me thinking. I know just very basic physics/electronics/computer architectures, but I’m a software developer/engineer professionally and haven’t spent time researching this particular area.

My naive understanding of unbuffered DDR4 DIMMs is, you assert the address bits and the data bits on the pins with every rising falling/edge and indicate whether you’re reading or writing, and you cross your fingers for the right amount of time/clock cycles. And every single one of those pins has to be “energized” just right at the right voltage at the right moment in time.

It just sounds unnecessarily hard.

On pretty much any recent x86 cpu (e.g. Pentium 4 and newer), L3 cache entries are 64bytes long. This makes DRAM reads/writes 64bytes long when you have to populate or evacuate a cache line (I’m guessing). So if you have a single typical DDR4 DIMM, you basically either 8 consecutive reads, or 8 consecutive writes for every L3 cache miss.

So same as we replaced pci with pci-e and ata with sata and scsi with sas. Can we do the same with ram modules? … and what would the latency implications be?

Ooh, exciting,… a new rabbit hole.

risk · April 18, 2021, 4:40am

TL;DR:

2019 era tech for Power9 and above
connect standard DDR4 ram to SMC1000, then from there using OpenCAPI to CPUs and other peripherals.
4ns extra
1.7W
8 lanes x 25Gbps per lane == 25GB/s of throughput (comparable to DDR4-3200).
OIF-28G-MR (which I think is the over the wire signalling spec).
84 pins per module (but only 8 lanes - why is that not 16-20pins per module.
approach being compared to older DDR2 FB-DIMM and AMD/Intel/… CXL (another rabbit hole… off I go).

risk · April 18, 2021, 6:12am

CXL/OpenCAPI/NvLink solve similar problems in spirit.
At a high level they allow devices to negotiate cache coherence / they disambiguate ownership of address space, in order to provide fast low latency reads/writes afterwards.

It looks like vaporware.

CXL is a bolt-on on top of/beside pci-e 5, using same electrical interface. Theoretically, a device can multiplex traditional pci-e transactions (buffered reads/writes) with additional CXL specific protocol, if that’s what it negotiates. PCI-E 5.0 comes at 32GT/s (4GB/s per lane). Point of using CXL instead of pci-e is to reduce latency. (or so everyone claims - but they cite 30-50ns which is horrendous).

There’s apparently a consortium calling itself Gen-Z, - hilarious when you look at members - or perhaps scary if you consider the common goal of all members being essentially mind control (big data and machine learning).

They claim to have shifted focus onto CXL from OpenCAPI.
There’s mention of CXL2.0 and pci-e 6.

All this to say there’s just too much politics there and all these people are doing is focusing on complicated protocols that let you fill half a 2U server with ram and other half with CPUs and GPUs and stick everything into a network.

It’s like desktop/workstation use cases don’t exist and are being pushed sideways and we’re stuck with parallel access DRAM DIMMs for years to come.

Most progress in that segment send to come from Apple shortening the wires (eDRAM?) with their M1 if you consider M1 Mac Mini a desktop - Tragic.

OMI / SMC1000 gets kinds of close to what I was thinking of - moving part of the memory controller into the memory module.

Interestingly, if I take my 640bits / 64bytes+overhead dumb protocol - at 25Gbps signaling rate per lane, it would take 25ns - at 8 lanes (16 wires one way + 16 wires other way + power/ground) that’s roughly 3ns - suspiciously close to 4ns overhead that OMI claims.

Apparently 100gbps/lane specs exist from oif, that would bring this dram memory latency serialization scheme down to <1ns, and get you 4x more bandwidth without going “quad channel” on the CPU pins.

Let’s see if desktops/workstations still exist in 2032 and what kind of ram we end up using.

(or if this GenZ consortium of boomer companies ends up making us all just use phones and TVs with every piece of our data in the cloud and turns into BuyLarge fat consumer people from wall-e).

Cheers!

Zibob · April 18, 2021, 8:59am

Thinknof itnlike F1 to road cars, they always say that F1 is intended to test the new stuff and may eventually it trickles down to road cars. Doesn’t always work and can take forever to get there but in think a similar thing happens with big data/servers, they test all the mesga expensive new tech that has the small gains and if it is viable enough it maybe eventually makes its way down to the standard parts. Like we can do optical at home now.

MetalizeYourBrain · April 18, 2021, 9:51am

Sure, let’s go!

Well not really. Every operation is scheduled at the CPU level to make the most sense for the state the system is in. The general idea is that the kernel is balancing CPU bound processes and I/O bound processes to mantain a good level of multiprocessing and responsiveness. Also having the RAM synced means that, if nothing catastrophic happens, you’re always gonna get all the data you asked for. Plus, if a peripheral needs data, you don’t even hit the CPU anymore because of the DMA.

Okay, that absolutely makes sense. I only new about entries size and associativity when it comes to cache so reading “line” confused me.

All the interfaces you mentioned are evolutions that do not change the basis. It’s like going from DDR3 to DDR4, same thing just refined.

So you’re just looking to “create” some kind of RAM that’s faster and more reliable than the actual standard, correct?

risk · April 18, 2021, 1:39pm

… some of it sadly never makes it.

Yes.

However it seems I’m a couple of years too late (…that being the only problem :p)

Kind of, but sometimes you have a CPU instruction that just takes a very very very very long time, a simple mov from ram to register that’s taking 200timea longer that it usually would because it’s waiting on ram.

MetalizeYourBrain · April 18, 2021, 2:59pm

It depends. A flush to RAM from the registers/cache is free, no CPU cycles required. But if you’re loading from RAM and the CPU is busy sure, you can get longer load times. But keep in mind that what looks like a long CPU bound process to us is constantly being broken up during it’s execution to avoid lock ups.
There is a window in which a CPU can be busy with one operation and once the window is over the CPU goes to the next in line. And the next is almost surely I/O, so RAM access.

All brain exercise, nothing wrong with that.

P.S. I hope I’m not saying obvious things, I don’t wanna sound condescending. I’m enjoying the conversation!

system · January 17, 2022, 9:00am

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.