Idea for a highly flexible NAS

I have been using snapraid on my NAS for a few years now. Snapraid is software which adds redundancy to a disk array, but unlike other types of RAID and RAID-like systems it works on a file level. This means you can take existing disks and create parity data for them, nothing on the disks is changed and the filesystem is kept exactly the same. The parity data is stored on another disk and snapraid supports up to 6 layers of parity meaning it can recover from up to 6 disk failures. This makes snapraid highly flexible in the sense that you can easily add or remove disks from the array or add and remove parity layers. It works with disks of different sizes and file systems so long as the largest data disk is smaller or of equal size to the parity disks.

Another advantage is that because the data disks are independent of snapraid if you lose too many disks that you can’t recover the data you only lose the data that was on those disks, data on the healthy disks is still okay. Similarly if there is some unrecoverable sectors during a recover you only lose the files affected by those sectors. As everyone knows, RAID is not a backup, but for home use it can be hard to justify the cost of a proper backup for a large amount of data. Snapraid is a sort of hybrid between RAID and backup in that it provides redundancy but also works from a snapshot so deleted or modified files can be recovered. And it has no risk (or at least no additional risk) of total data loss if too many disks are lost for a full recovery.

However snapraid does have some limitations. It was designed for use with media libraries, specifically storage of large files which don’t change or don’t change often. Snapraid doesn’t calculate parity in real time, instead it calculates parity for a snapshot. If files are added after a sync and a disk dies, then you will lose any new files on the disk which failed. That’s not too much of a problem, but if files are removed or modified after a sync and a disk fails then there will be errors during recovery as the data won’t match the parity. Now most of the files will still be recoverable but others will not, and the more data which is changed the more files become unrecoverable. This makes snapraid unsuitable for storage which isn’t static like a media library.

My idea is to take the advantages of snapraid and combine them with btrfs to add end to end integrity checks to the storage as well as removing the problem of unsynced file changes causing data loss durring recovery. The current version (10) of snapraid doesn’t fully support btrfs, however I’m not sure exactly what isn’t supported and I think it’s only an issues if using btrfs for parity disks, which isn’t necerssarry for this idea. Once version 11 is released I will convert my disks to btrfs and try it out (this can be done in place which is awesome).

Essentially the idea is to use btrfs as the filesystem for the data disks. This will be a single disk configuration so each disk is individual and not in any kind of RAID. Snapraid has built in integrity checking for everything that it syncs, any time data is read by snapraid it is checked against stored checksums and can be repaired if an error is found. However when the user reads data from the array no integrity checks are performed as that is done on the file system and not through snapraid. Using btrfs as the filesystem for the disks means that when the user reads from the disk each block is checked. Because the disks aren’t in any kind of RAID the files can’t be repaired by btrfs, but the error will be detected and then the user can use snapraid to repair it.

Depending on how snapraid detects and reports integrity errors it may be possible to have a script which takes the file detected by btrfs and automatically runs the snapraid fix command, repairing that file. It won’t be able to repair files on the fly like btrfs and zfs do but at least the user can be alerted in some way that the file they are accessing is corrupt. Testing is required to work out the best way of doing this but if it’s not possible to fix corrupted files as they’re detected then it should at least be able to notify the user of an error.

For the problem with unsynced data my sollution is to run snapraid against a btrfs snapshot rather than the disk. So when running snapraid sync you first create a snapshot of the disk using btrfs and run sync against that snapshot. This way any changes made afterwards won’t affect the recovery because all those changes are stored separately. If a disk fails then you will only lose changes made to that disk since the last sync but it won’t prevent the recovery of other files. You can also create additional snapshots so that you can roll back to previous versions and this won’t affect snapraid either as it will only care about the snapshot created for use with snapraid. If you have to recover a disk you will lose all the other snapshots on the disk but snapshots on other disks won’t be affected.

This configuration is for those who care more about flexibility than performance; being able to add or remove disks and not worry about total data loss while still wanted redundancy and integrity checks. It still isn’t suitable for all workloads but by using snapshots, snapraid can be used for more than just static storage. Having siad that if you need some high performance storage for VMs or something else there’s no reason that you couldn’t use an SSD or a couple of disks in RAID 0 for that. As long as the disks are smaller or equal to the size of the parity disks then anything can be added to the array. This means you can have RAID 0 with redundancy or an SSD with mechanical redundancy without sacrificing performance. Like I said, it’s flexible.

Additionally if you want to present your storage as a pool rather than as individual disks then you can use something like AUFS to do that. AUFS stacks directories on top of each other so they appear as a single volume but they are still individual disks (as like snapraid this doesn't touch the original file systems). SSD caching can also be added to the disks by using bcache or something similar.

Of course there are still limitations. While using snapshots means that you no longer have to worry about unsynced data causing problems durring recovery, it still isn’t real time, so if a disk fails that has been modified after a sync, those changes are lost. For highly transactional workloads this configuration probably isn’t suitable. Because parity is calculated for a snapshot the sync process can take a while, as all the parity is calculated at that time. If there are large files like virtual disks which have changed then the parity for the whole file has to be recalculated, not just the blocks which have changed. This limits how frequently a sync can be run, but daily or every 12 or 6 hours will be fine, depending on workload.

Anyway, I haven’t tried this yet but I’m thinking of converting when snapraid 11 is released. Once I do I will be able to develop it further and post scripts and stuff like that. Obviously it’s not going to be for everyone but I think this system is a nice middle ground between individual disks and something like ZFS.

5 Likes

Here I am just plugging in a external drive for backup, feel like a chump.

Very cool read though.

1 Like

I used to use external drives. Lost so much data. Never again :P

1 Like