The DRAM cache on flash SSDs wasn't what I thought it was, it only caches the FTL

So recently in a reddit thread there was some fascinating comments made by the “Intel Storage Technical Analyst” malventano, that completely contradicted how I (and probably a lot of you) presumed a cache on an SSD works. So as lame as just reposting reddit comments is, I’d really hate for this information to just fade into the void.

FTL= Flash Transition Layer
OP=Over provisioning
PLP=Power Loss Protection

[–]malventano
Slowly playing whack-a-mole with this misconception (I’m looking at you, Linus Tech Tips), but SSD DRAM is there to cache the FTL (this is why there is almost always 1MB DRAM per 1GB NAND - referred to as a 1:1 ratio for FTL). DRAM almost never caches user data - it passes through the controller straight to the input buffers of the NAND die, possibly hitting some SRAM along the way, but nothing that would qualify as a DRAM cache.

[–]other_user
Then why do drives with power loss protection have such a massive performance advantage with sync writes compared with drives that do not? This is exactly the behavior discussed by Serve The Home as well:
https://www.servethehome.com/what-is-the-zfs-zil-slog-and-what-makes-a-good-one/
“To maximize the life of NAND cells, NAND based SSDs typically have a DRAM-based write cache. Data is received, batched, and then written to NAND in an orderly fashion.”

[–]malventano
I don’t know where the misconception started. Most likely that DRAM on HDDs does cache user data. Different drives report completion at different stages of the process. Data center drives have higher OP that enables them to maintain lower steady state write latencies for a given workload compared to lower end drives. PLP caps typically come on those same drives with higher OP, but the existence of those caps is not the thing that’s increasing the performance. Further, those caps don’t have enough energy density to keep an 18-25W SSD going for the several seconds that it would take to empty GB’s (or even MB’s for worst case workload) of user data. Just count up the total microfarads of the caps and figure how fast voltage drops with a 20W load. We’re talking fractions of a second here.

[–]other_user
Every technical article I’ve ever read disagrees with this. Is there anything you can point to that backs up this assertion? Is it possible that some drives behave the way you describe and others do not?
https://www.vikingtechnology.com/wp-content/uploads/2021/03/AN0025_SSD-Power-Fail-Protection-Whitepaper-RevE.pdf
https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
https://www.atpinc.com/blog/why-do-ssds-need-power-loss-protection
https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
https://newsroom.intel.com/wp-content/uploads/sites/11/2016/01/Intel_SSD_320_Series_Enhance_Power_Loss_Technology_Brief.pdf
“During an unsafe shutdown, firmware routines in the Intel SSD 320 Series respond to power loss interrupt and make sure both user data and system data in the temporary buffers are transferred to the NAND media.”

[–]malventano
I reviewed the SSD 320, and I now work at Intel. ‘Buffers’ refers to input buffers of the NAND die, not the DRAM. Simple way to put this to bed is that if the DRAM cached user data as the misconception would have you believe, then SSDs with multi-GB of DRAM would do amazing on benchmarks with smaller test files. Crystaldiskmark default is 1GB, and it would be going RAMdisk speeds on data center SSDs. It doesn’t. Even 4KB random reads on a 1MB (not 1GB) span always go NAND speeds, not DRAM speeds.

[–]malventano
Slowly playing whack-a-mole with this misconception (I’m looking at you, Linus Tech Tips), but SSD DRAM is there to cache the FTL (this is why there is almost always 1MB DRAM per 1GB NAND - referred to as a 1:1 ratio for FTL). DRAM almost never caches user data - it passes through the controller straight to the input buffers of the NAND die, possibly hitting some SRAM along the way, but nothing that would qualify as a DRAM cache.

[–]malventano
That 6GB isn’t caching user data. It caches the FTL. Power loss caps don’t hold enough charge to cover writing multiple GB of user data at worst case scenarios (single sector random). That low latency you see on writes is due to the SSD reporting completion once the write has passed to the input buffer of the NAND die (and the caps are only sufficient to ensure what’s in those buffers will commit to NAND).

[–]malventano
Consumer SSDs have lower over-provisioning (total NAND closer to available capacity) and range from that same 1:1 ratio (1 MB DRAM per 1 GB NAND) all the way down to ‘dramless’ (but there’s still a tiny amount in the controller, and some can use the host memory to store some of it (HMB)). Smaller DRAM caches will hold smaller chunks of the FTL in memory, which makes performance fall off of a cliff sooner the ‘wider’ your active area of the SSD is. Think of it as a big database that has to swap to disk when there isn’t enough DRAM to hold the whole thing. Most ‘average user’ usage can get away with very low ratios, but DC SSDs generally always hold the full table in DRAM because the assumption is that the entire span of the NAND will be active.

A good tip/gottcha when benchmarking SSDs

[–]other_user
I noticed during benchmarking that smaller partitions on an SSD gave proportionately slower write performance, as if the partition was pre-allocated to to a subset of the NAND packages.
Would not a full-disk SLOG partition give the same wear endurance benefit, while allowing the SSD controller to spread writes across more NAND packages? Only a small portion would ever be used at any time due to txg tunables of course.

[–]malventano
This really shouldn’t happen (and normally it’s the other way around since shorter spans net you higher effective overprovisioning, meaning higher steady state performance). If you progressively tested smaller partitions then it’s possible that your prior tests fragmented the FTL and slowed the performance on those later tests. TRIM the whole SSD between tests to avoid this.

And some insight into bad block remap behavior

[–]malventano
As a general rule, SSDs and HDDs will not remap a bad block on reads. Only on writes. This is to aid in data recovery, and you’d also never want your storage device to just on its own silently replace a bad block with corrupted data. It’s better to timeout so the user knows the data was in fact unreadable.

[–]other_user
That seems sensible. I just contrast this with a number of HDDs I’ve used that developed remapped sectors w/out ever reporting errors to the OS. I suppose the HDDs must have recognized the errors during the write and the SSD during a read operation.

[–]malventano
HDDs have a bit more wiggle room on if they can chance a successful read after a few head repositions, so if a HDD FW had to retry really hard to read a sector and succeeded after a few attempts, the FW would remap the sector and mark that physical location as bad.
SSDs have similar mechanisms but instead of it being a mechanical process the read is retried with the voltage thresholds tweaked a bit. Sometimes that gets the page back, but SSD FW is less likely to map a block out as bad (or less likely to report it in SMART if it did) as it’s more of a transparent / expected process to have cell drift over time, etc (meaning the block may have looked bad but it’s actually fine if rewritten). SSDs have higher thresholds for what it takes to consider a block as failed vs. HDDs. That all works transparently right up until you hit a page/block that’s flat out unreadable no matter what tricks it tries, and that’s where you run into the timeout/throw error to user scenario.
This behavior goes way back - I had an X25-M develop a failed page and it would timeout on reads just like your M4 until the suspect file was overwritten (this doesn’t immediately overwrite the NAND where the file sat, but the drive would not attempt to read that page again until wear leveling overwrote it later on).

I always appreciate when the people “in the weeds” take the time to correct popular knowledge.

12 Likes

I like Allyn, and he’s even been on a couple L1 vids.
He was probably the thing on PC Per I actually watched/read, and He knows way more than I, but I am surprised,

HDD controllers have big caps/BBU’s to store enough charge to either keep their ram cache active, or write it out to their disks. SSD’s don’t use as much power, and I am surprised the caps don’t have enough juice, but I don’t have PLP drives to find out.

I dunno, some of the drives, including 4 level cells, do suspiciously well for a while, then fall off a cliff.

Ones with dynamic “SLC” emulations work less well than the ram cache ones.
It may only be the FTL in action, but Allyn’s own tool kit was developed to work around the cache, for a reason.

In the same way testing array speed, one discounts the first cache chunk, before evaluating speed.

But, I don’t work at a former SSD vendor, nor am I a storage guy.

I’ll take his words as genuine, and bear them in mind.

Thanks for sharing @Log

2 Likes

I second that.

When I heard about 15W power and MICROfarad, I was like “duh, why didn’t I think of that before?”. Yeah…that level of capacitors will certainly not keep the entire drive going for very long.
But caps are like SRAM, very very cool but take lots of space.

Suddenly this all makes some sense in my aging brain. I had the misconception of SSD DRAM caching user data as well.

My Toshiba drives actually have PLP according to data sheet. Can’t say what tech is used for the 512MB cache, might be MLC. And HDD electronics have much more space to work with after all. I wouldn’t call them “big caps”, Big caps is 100Farad to me :slight_smile:

1 Like

Much to my chagrin, I recently learned this as well, even though I’ve been using SSDs since the first relatively-affordable SandForce-based drives hit the market, and consider myself an expert PC builder.

It becomes truly problematic if you are (like I was) going around telling people they don’t need SSDs with “DRAM write cache” for their boot drives when in fact the opposite is true: If the FTL is cached in DRAM the OS is much more responsive, and the drive will last longer and come with a longer warranty. I’ve completely flipped my standard advice now, typically recommending a smaller, higher-tier boot drive and a larger, lower-tier apps/games drive.

Fortunately, this was more of an issue with SATA SSDs, which are gradually phasing out (if for no other reason than they usually aren’t much cheaper than NVMe SSDs). HMB can mitigate the downsides of using a DRAM-less NVMe SSD as a boot drive.

I think it’s important to raise awareness of this misconception—especially in communities like ours—and appreciate this post with all the detailed quotes from the true expert! Thank you.

1 Like

I’m a bit thrown by the assertion that an SSD consumes 20w of power?

I mean I’m not ruling out that some do, but using the M1 mac mini I have to hand in a plug in power meter, the delta between running the black magic disk speed test and not running it is about 4w (so 11w versus 7w total system consumption) and you’ve got to think some of that must be CPU usage etc.

PM9A3 ssds are quoted under 8w in M.2 and under 13.5w in other form factors here:

I’m pretty sure that 8w corresponds to the maximum power delivery of the M.2 interface.

I mean, of course the caps don’t have enough power to keep the SSD going, but they don’t need to, they just have to have enough power to ensure that the data needed to preserve the last write committed and the state of the flash transition layer will be intact the next time the SSD receives power - that doesn’t necessarily mean that they are neatly filed away in their final TLC resting place, it could just as well be a very fast SLC buffer that can be committed in a tiny fraction of a second.

So long as the SSDs with PLP keep doing what we expect them to do it doesn’t make much difference in any case.

One additional complication, is that in addition to dram FTL caching, some drives like the Samsung 840/850/860 EVO’s do also have an SLC write buffer (TurboWrite), originally up to 12GB, later being up to 78GB. I think this is what got most of us confused. So very intensive steady state activity could cause a performance cliff when they have to drop back to TLC only, even if they seem to have enough dram (~1G per 1TB) to avoid FTL thrashing.

Yeah, good to bring this up. I was one of the posters in that thread going back and forth about this because I’ve never heard it described this way before in any other article discussing SSDs. I guess at the end of the day it doesn’t really matter all that much what is happening as long as that when the SSD ack’s a write, it is written to persistent storage and won’t be lost in case of a power failure.

But what I would like to understand is what is physically different in the data path of the SSD when it has PLP vs when it does not. There is clearly a performance difference between a sync write with a PLP drive and a sync write with a non-PLP drive - why is this? Assuming the Reddit post is correct, what is the main driver here? It seems like in both cases the data has to be written to the NAND before it can be considered persistent, so why is PLP so much faster at doing this than a drive without PLP?

something isn’t right with what malventano is saying; I’m somewhat sure the dram write cache of ssds does both storage data and mapping tables because i remember when nvme1.2 HMB was being touted as the new cool thing it mentioned that it was mostly being used for mapping tables and not storage data hence the inferior performance compared to dram backed ssds. I’m going to look for a white paper on this.

Okay, what he says tracks. I missed the part about him saying “almost never”. A DRAM cache does buffer user data in the same way a cache does, and that does result in a performance increase; it’s just a 0.5-4GB total DRAM cache (in modern ssds) is tiny and doesn’t always contribute much in “common” transfer sizes while also being shared with the nand addy luts.

Figure 9 in the following paper illustrates the performance degradation after writing outside of the DRAM buffer; it also shows that in fig8(d) vs fig9(d) that the DRAM cache for that particular ssd is not being used for writes, but it can be inferred that it is being used for reads insofar that it is being used for nand map tables, arguably proving that for small writes, the DRAM is being used directly for user data in the other 3 drives:

Kim JH, Jung D, Kim JS, Huh J. A methodology for extracting performance parameters in Solid State
Disks (SSDs). Proceeding of 2009 IEEE International Symposium on Modeling, Analysis & Simulation
of Computer and Telecommunication Systems. 2009: 1–10.

​ ​ ​​

​ ​ ​​
​ ​ ​​

​ ​ ​​

The reason you’re not going to get RAMdisk speeds out of an SSD is because there is so much more overhead, latency and nondeterminism to the protocol and bus the SSD is on over the bus system memory is on; we’re talking several orders of magnitude. look at optane dc pc compared to optane nvme benchmarks.

This is what has got me bothered about the radical CXL advocates… CXL isn’t going to replace system memory because it is on a high latency, less-deterministic, lower bandwidth bus than system memory would be on.

The Micron 2550 is a DRAM-less PCIe Gen4 SSD and it hits 5GB/s no problem. Just proofs that DRAM-less NVMes are great too: Micron 2550 DRAMless Gen4 Client SSD Review - The Worlds First 232-Layer NAND SSD Is A Game Changer | The SSD Review

I would agree with the whole DRAM cache thing from Linus is only really a SATA SSD thing and much less relevant as so few buy SATA these days.

1 Like

The problem with using DRAM as buffercache is that:

  • The buffer needs to be emptied on every sync write, meaning you will not buffer much at all and the buffer would constantly be emptied.
  • When the SSD loses power the data in DRAM will be lost. That is no problem if you adhere to standards and obey synchronous writes. Real performance benefits will be achieved only when you ignore the synchronous write and buffer all writes the unsafe way.

The problem with that is that filesystems are designed to lose recent writes since the last synchronous write. Modern filesystems have journals and “intent logs” (ZIL; ZFS) that keep the filesystem consistent by insisting that data is written permanently after a synchronous write. If you begin lying about that, then power loss could result in an ‘unformatted’ drive (“Would you like Windows to format this drive for you?” uhhh WHAT?!!!) and other filesystem damage with possibly entire directories being moved to the lost&found directory as being orphaned.

So using DRAM as buffercache is not all that spectacular. It would only provide additional speeds if they would lie about synchronous writes and take a risk of corruption upon power loss. Intel used internal SRAM (not SDRAM but SRAM) of 256 kilobytes i believe, to actually buffer writes. The DRAM was to cache the FTL which does result in added performance.

1 Like

The micron 2550 uses NVME1.2 spec’s HMB, so it’s using system DRAM for the cache instead of dedicated DRAM on the m.2. There is some amount of overhead to HMB fetching across the pcie bus but it doesn’t show in most workloads.

In a sustained write or read workload you’ll see the 2550’s performance drop to a third or half of what other good consumer NVME drives do; although perhaps that has more to do with how the pSLC cache is being managed.

Consumer storage devices don’t have data integrity audits at every step so there isn’t really a concept of sync writes to them; requesting cache flushes is merely a suggestion to all the drives I’ve seen.
All SSD controllers use registers in them for a variety of processing and buffering functions and they are all made of SRAM, and therefore volatile. Meaning without power loss protecting they are perfectly capable of losing and/or corrupting data with or with DRAM onboard.
Modern file systems have some very sophisticated recovery mechanisms for dealing with these situations which is why the issue isn’t more prevalent than you’d think it would be.

1 Like

This topic was automatically closed 273 days after the last reply. New replies are no longer allowed.