While there are tons of advice out there on using SSDs as cache devices for ZFS, there is very little on using SSHD as raw disks for ZFS. Most discussions on this topic I find on the net end up discussing generic aspects of cache and usage patterns. It goes without saying that whether this usage is a good idea will depend heavily on the use cases, budget and a lot of parameters, but I’m for now mainly curious on how it would work. I use the words “nand” and “flash” interchangably in the below text to refer to the SSD portion of an SSHD. Here follows my thoughts so far on this prospect. As it became a fair amount of text I try to use subheadings. Some knowledge of ZFS concepts are assumed.
SSHD nand cache vs L2ARC
Just to have the basics in place we need to sort out the different cache models. It is clear that SSHD is a very different cache model from using an SSD as L2ARC:
The L2ARC operates on filesystem-level data, and connects between the file system and a vdev. L2ARC is for good reasons akin to a write-through cache, it is volatile and can thus not stash away data and hope for it to be written to disk in the future, it would go against the data integrity model of ZFS. As a result write cache on ZFS can only live for a few seconds (and will stay in RAM). Another consequence is that the layout of the array will not matter when the L2ARC hits on a read.
The SSHD nand on the other hand is never seen by ZFS and is part of the physical device always. I don’t know exactly how the SSHD operates, but it is likely that data written to the nand flash will not be written to disk until swapped out (for example when its space is needed for more recently used data), or at least not until after the write command finished (i.e. the OS gets the message that the write is complete). So it will be a writeback cache, which is also optimal performance-wise. This is no problem for data integrity, as the cache is always exactly as redundant as the attached spinning disk. Sounds good - yes, but what about array layout?
SSHD and array layout
The layout of the arrays would affect how well the NAND cache in an SSHD will do its job. Most of these aspects hold true for any hard disk’s RAM cache too, as far as I can see. Consider the following layout cases:
RAIDZ(1,2…N): In my understanding this seems like a straightforward case. Every physical device in a RAIDZ vdev will for each data block hold either a portion of the block or a piece of parity checksum for the block. When the block is read then all disks will do their part, except perhaps that the second and third parity disk is allowed to rest (given that parity sums up). In general, most disks will read data, and whatever they read will be put in the flash. Importantly, when the same data block is requested again, the same disk will deliver the same data. Therefore the flash part of the SSHD should work as its best - regardless of whether it is parity or data being read from the disk, it is 100% likely to be found in flash if requested immediately after (again, not considering the possibility that parity disks > 1 might be allowed to rest, in the case of RAIDZ2…N). This math holds equally for reads and writes.
MIRROR: Writes are nothing special as these are sent to all disks, so recently written data will sit in the flash of all disks, ready to be read again without involving the mechanics. Reads of recently written data should then hit cache. But when it comes to reads of previously “cold” data there are a few possible issues. If I understand correctly read commands are sent alternately to each disk, requesting different parts of the data in the classic raid1 fashion. I don’t know whether ZFS has any preference, by design or otherwise, to always request the same block from the same disk in these cases. It may well be (as I speculate) that if block A is read from disk A one time, on the next read block A will be read from disk B. So the first read in this case would be a cache miss as usual, but also the second read as it is issued to another disk. For an N-way mirror there will on average be N reads of the same data that needs to go to the spinning disk even though it was recently cached in the disk that just read it. I guess that this will even out with subsequent reads, and with time hot data will sit in the flash of all involved disks - even though each piece of data technically would only need to sit in one of the available SSHD flash chips. Unless of course ZFS has a way to always request the same data from the same disk when reading from a mirror.
The conclusion is that the flash will most likely be subject to the same size overhead as the spinning disks - effectively striped over the number of data disks in the case of raidz, yielding the cache size of N nands for a N+parity raidz, and mirrored in the case of a mirror, yielding the cache size of 1 nand for any number of devices. This is perhaps not so surprising at face value, that the cache of a mirror becomes mirrored. But what is interesting is that it could effectively be a striped read cache in the mirror case, if there was a way to have zfs read every block from the same device (barring checksum failures of course).
I read when searching around that ZFS tries to direct writes from a mirror to the “least busy” device. For a simple situation (unlikely in the datacenter but likely in the home lab) in which only one set of data is being read, for example booting a VM from an image, this could mean that blocks are read alternating from the mirrored disks in a predictable fashion, but it will at least be dependent on from which disk the first block is read. But another read from the same disk during this can easily reverse the pattern.
In the opposite case of a total random workload, reads would effectively go to random devices in the pool, and we would expect a ramp-up och cache performance according to the following in the case of a 2-way mirror:
- 1st read ever of a piece of data: total cache miss (obviously)
- 2nd read of the same data: at least 1/2 of the blocks are found in cache (half of reads will go to the wrong disk)
3rd read: 1/2 of reads that hit last time will miss this time (but hit next)
Edit: corrected above calculation.
So there will be the same amount of unnecessary misses over several reads, but their distribution over time will vary with some chance. I am writing “found in cache” because they already are in cache sine read #2, it’s just that the system looks in the wrong cache. Here I am of course factoring out ARC and all other active caches.
So this is the worst case in terms of performance, and still perhaps not that bad. And accepting that SSHD nands are mirrored when they could in theory be striped in the mirror case is perhaps not bad either, after all, we kind of bought the “half the size” deal when going to mirrors from start. RAIDz as mentioned does not have this problem.
Other pros and cons
+(-) As the time of writing L2ARC contents does not survive reboot on all operating systems. It does on Solaris and Illumos, it does not on Linux (there is some work on it but it seems to have stalled since a year back), and I am unsure about FreeBSD. Therefore SSHD may be the only easy way to have a reboot-persistent flash-based cache with ZFS and Linux.
+If my assumptions about SSHD inner workings the cache is write back, which means that SSHD is a simple way to enable write-back caching on ZFS.
+L2ARC needs extra RAM for bookkeeping, so adding an L2ARC does not really offload system RAM. Using SSHDs instead allows for a smaller ARC portion of RAM without losing performance.
-SSHD flash portion is quite small compared to even a small separate SSD, usually it is 8Gb. I don’t know of any above-consumer-grade SSHDs.
Makeshift SSHD using an SSD and a HDD?
I don’t know whether it is possible to mimick a SSHD using a regular SSD and a HDD, in a way which allows the resulting device to be part of a ZFS vdev? This would be a very interesting prospect, as we won’t then have the size constraint of consumer SSHDs. We could perhaps combine this with NAS HDDs which spin down when idle, and with clever provisioning we could have them sleep for most of the time.
From my long thought extract above I find this idea not too bad for some use cases. It would be a quite cheap way to add some flash cache to ZFS which are also write-back without putting data integrity at risk. It might not be so useful in the datacenter, but for a single user workstation or a home lab hosting a few virtual machines with high data locality (for example gaming VMs - at least I play the same game several occasions in a row) it would make sense. The only major quirkiness is the fact that the SSHD nands will not be optimally used in a mirror, both when it comes to size and hit efficiency. Any input on this is welcome, and maybe someone here have tested already?