The DRAM cache on flash SSDs wasn't what I thought it was, it only caches the FTL

So recently in a reddit thread there was some fascinating comments made by the “Intel Storage Technical Analyst” malventano, that completely contradicted how I (and probably a lot of you) presumed a cache on an SSD works. So as lame as just reposting reddit comments is, I’d really hate for this information to just fade into the void.

FTL= Flash Transition Layer
OP=Over provisioning
PLP=Power Loss Protection

[–]malventano
Slowly playing whack-a-mole with this misconception (I’m looking at you, Linus Tech Tips), but SSD DRAM is there to cache the FTL (this is why there is almost always 1MB DRAM per 1GB NAND - referred to as a 1:1 ratio for FTL). DRAM almost never caches user data - it passes through the controller straight to the input buffers of the NAND die, possibly hitting some SRAM along the way, but nothing that would qualify as a DRAM cache.

[–]other_user
Then why do drives with power loss protection have such a massive performance advantage with sync writes compared with drives that do not? This is exactly the behavior discussed by Serve The Home as well:
https://www.servethehome.com/what-is-the-zfs-zil-slog-and-what-makes-a-good-one/
“To maximize the life of NAND cells, NAND based SSDs typically have a DRAM-based write cache. Data is received, batched, and then written to NAND in an orderly fashion.”

[–]malventano
I don’t know where the misconception started. Most likely that DRAM on HDDs does cache user data. Different drives report completion at different stages of the process. Data center drives have higher OP that enables them to maintain lower steady state write latencies for a given workload compared to lower end drives. PLP caps typically come on those same drives with higher OP, but the existence of those caps is not the thing that’s increasing the performance. Further, those caps don’t have enough energy density to keep an 18-25W SSD going for the several seconds that it would take to empty GB’s (or even MB’s for worst case workload) of user data. Just count up the total microfarads of the caps and figure how fast voltage drops with a 20W load. We’re talking fractions of a second here.

[–]other_user
Every technical article I’ve ever read disagrees with this. Is there anything you can point to that backs up this assertion? Is it possible that some drives behave the way you describe and others do not?
https://www.vikingtechnology.com/wp-content/uploads/2021/03/AN0025_SSD-Power-Fail-Protection-Whitepaper-RevE.pdf
https://www.kingston.com/en/blog/servers-and-data-centers/ssd-power-loss-protection
https://www.atpinc.com/blog/why-do-ssds-need-power-loss-protection
https://us.transcend-info.com/embedded/technology/power-loss-protection-plp
https://newsroom.intel.com/wp-content/uploads/sites/11/2016/01/Intel_SSD_320_Series_Enhance_Power_Loss_Technology_Brief.pdf
“During an unsafe shutdown, firmware routines in the Intel SSD 320 Series respond to power loss interrupt and make sure both user data and system data in the temporary buffers are transferred to the NAND media.”

[–]malventano
I reviewed the SSD 320, and I now work at Intel. ‘Buffers’ refers to input buffers of the NAND die, not the DRAM. Simple way to put this to bed is that if the DRAM cached user data as the misconception would have you believe, then SSDs with multi-GB of DRAM would do amazing on benchmarks with smaller test files. Crystaldiskmark default is 1GB, and it would be going RAMdisk speeds on data center SSDs. It doesn’t. Even 4KB random reads on a 1MB (not 1GB) span always go NAND speeds, not DRAM speeds.

[–]malventano
Slowly playing whack-a-mole with this misconception (I’m looking at you, Linus Tech Tips), but SSD DRAM is there to cache the FTL (this is why there is almost always 1MB DRAM per 1GB NAND - referred to as a 1:1 ratio for FTL). DRAM almost never caches user data - it passes through the controller straight to the input buffers of the NAND die, possibly hitting some SRAM along the way, but nothing that would qualify as a DRAM cache.

[–]malventano
That 6GB isn’t caching user data. It caches the FTL. Power loss caps don’t hold enough charge to cover writing multiple GB of user data at worst case scenarios (single sector random). That low latency you see on writes is due to the SSD reporting completion once the write has passed to the input buffer of the NAND die (and the caps are only sufficient to ensure what’s in those buffers will commit to NAND).

[–]malventano
Consumer SSDs have lower over-provisioning (total NAND closer to available capacity) and range from that same 1:1 ratio (1 MB DRAM per 1 GB NAND) all the way down to ‘dramless’ (but there’s still a tiny amount in the controller, and some can use the host memory to store some of it (HMB)). Smaller DRAM caches will hold smaller chunks of the FTL in memory, which makes performance fall off of a cliff sooner the ‘wider’ your active area of the SSD is. Think of it as a big database that has to swap to disk when there isn’t enough DRAM to hold the whole thing. Most ‘average user’ usage can get away with very low ratios, but DC SSDs generally always hold the full table in DRAM because the assumption is that the entire span of the NAND will be active.

A good tip/gottcha when benchmarking SSDs

[–]other_user
I noticed during benchmarking that smaller partitions on an SSD gave proportionately slower write performance, as if the partition was pre-allocated to to a subset of the NAND packages.
Would not a full-disk SLOG partition give the same wear endurance benefit, while allowing the SSD controller to spread writes across more NAND packages? Only a small portion would ever be used at any time due to txg tunables of course.

[–]malventano
This really shouldn’t happen (and normally it’s the other way around since shorter spans net you higher effective overprovisioning, meaning higher steady state performance). If you progressively tested smaller partitions then it’s possible that your prior tests fragmented the FTL and slowed the performance on those later tests. TRIM the whole SSD between tests to avoid this.

And some insight into bad block remap behavior

[–]malventano
As a general rule, SSDs and HDDs will not remap a bad block on reads. Only on writes. This is to aid in data recovery, and you’d also never want your storage device to just on its own silently replace a bad block with corrupted data. It’s better to timeout so the user knows the data was in fact unreadable.

[–]other_user
That seems sensible. I just contrast this with a number of HDDs I’ve used that developed remapped sectors w/out ever reporting errors to the OS. I suppose the HDDs must have recognized the errors during the write and the SSD during a read operation.

[–]malventano
HDDs have a bit more wiggle room on if they can chance a successful read after a few head repositions, so if a HDD FW had to retry really hard to read a sector and succeeded after a few attempts, the FW would remap the sector and mark that physical location as bad.
SSDs have similar mechanisms but instead of it being a mechanical process the read is retried with the voltage thresholds tweaked a bit. Sometimes that gets the page back, but SSD FW is less likely to map a block out as bad (or less likely to report it in SMART if it did) as it’s more of a transparent / expected process to have cell drift over time, etc (meaning the block may have looked bad but it’s actually fine if rewritten). SSDs have higher thresholds for what it takes to consider a block as failed vs. HDDs. That all works transparently right up until you hit a page/block that’s flat out unreadable no matter what tricks it tries, and that’s where you run into the timeout/throw error to user scenario.
This behavior goes way back - I had an X25-M develop a failed page and it would timeout on reads just like your M4 until the suspect file was overwritten (this doesn’t immediately overwrite the NAND where the file sat, but the drive would not attempt to read that page again until wear leveling overwrote it later on).

I always appreciate when the people “in the weeds” take the time to correct popular knowledge.

5 Likes

I like Allyn, and he’s even been on a couple L1 vids.
He was probably the thing on PC Per I actually watched/read, and He knows way more than I, but I am surprised,

HDD controllers have big caps/BBU’s to store enough charge to either keep their ram cache active, or write it out to their disks. SSD’s don’t use as much power, and I am surprised the caps don’t have enough juice, but I don’t have PLP drives to find out.

I dunno, some of the drives, including 4 level cells, do suspiciously well for a while, then fall off a cliff.

Ones with dynamic “SLC” emulations work less well than the ram cache ones.
It may only be the FTL in action, but Allyn’s own tool kit was developed to work around the cache, for a reason.

In the same way testing array speed, one discounts the first cache chunk, before evaluating speed.

But, I don’t work at a former SSD vendor, nor am I a storage guy.

I’ll take his words as genuine, and bear them in mind.

Thanks for sharing @Log

1 Like

I second that.

When I heard about 15W power and MICROfarad, I was like “duh, why didn’t I think of that before?”. Yeah…that level of capacitors will certainly not keep the entire drive going for very long.
But caps are like SRAM, very very cool but take lots of space.

Suddenly this all makes some sense in my aging brain. I had the misconception of SSD DRAM caching user data as well.

My Toshiba drives actually have PLP according to data sheet. Can’t say what tech is used for the 512MB cache, might be MLC. And HDD electronics have much more space to work with after all. I wouldn’t call them “big caps”, Big caps is 100Farad to me :slight_smile:

1 Like

It is nice that the abbreviation was mentioned, but for anyone else who does not know about the Flash Transition Layer, have a read!

3 Likes