Differential (Delta) backups for Linux?

The NAS is a build in progress. I’ve got the parts but ran into an issue with the LSI card, and we’re working on that in another thread.

The reason I was avoiding ZFS is that it is not officially supported on Linux. I’ve used software RAID before, and never really had a problem with it. I’ve never used ZFS, but know how to use MD and LVM (and luks when necessary). I really would rather use tools I know how to use and understand for a production system.

Are there no options for getting delta snapshots with a traditional file system?

The starting configuration for the NAS is 6x 14TB WD ultrastars. 42TB usable. The NAS box will actually be a KVM host (probably CentOS8) running VMs for NFS, LDAP, mail, phones, web, etc.

I didn’t plan on using Samba. All clients are Linux. I will probably go pure NFS with OpenLDAP to keep userIDs in sync.

ZFS is not that bad as long as you’re willing to use slightly older kernels. For example, stick to LTS kernels or use a non rolling release distro with dkms.


Other options for cheap snapshots I’m aware of is to use btrfs, LVM, xfs freeze or bcachefs.

btrfs is in upstream Linux, has subvolumes where ZFS has datasets. Snapshotting works pretty much the same otherwise - you get some files mounted somewhere. (in btrfs they’re writeable by default, in ZFS they’re readonly by default, otherwise same thing), works well. btrfs also has incremental send/receive and checksumming like ZFS. Compared to ZFS it lacks built-in online dedup (can dedup later or copy by reference cp --reflink), built-in encryption (can use luks under btrfs) and ZVOLs(in zfs you can make snapshottable block devices same as you make datasets and use them for your VMs), and the option to move ZIL/SLOG to a faster volume (can use dm-writecache or similar underneath btrfs). Like ZFS, you get efficient snapshots with send/receive, checksumming, scrubbing, per file based recovery (from parity or other copies … if a file is broken, you can delete it and recover from another copy). You can also ask for ‘dup’ to get 2 copies of data or metadata like you can do with copies=2 in ZFS. To ensure you can recover even if you have a single block device underneath.

LVM snapshots (see gentoo wiki writeup)work by creating new logical volumes using existing volumes as a starting point). That means mounting filesystems by their filesystem superblock uuid is a bad idea, and you need to be careful how you monitor free space in the volume group and how you grow logical volumes and filesystems inside the volume group. It’s more work than just ZFS or btrfs.

Otherwise, these block level snapshots are less efficient than btrfs, ZFS snapshots, because LVM (device mapper underneath that’s doing the heavy lifting) won’t know anything about filesystem layout and will generally copy more data whenever you change anything than required.
This means you’ll use more space and you’ll lose some performance relative to btrfs/ZFS snapshots if you have a high churn in your dataset or if you have lots of snapshots.
In terms of making snapshots, you make a new LV with LVM, and you make sure you mount it read-only for your users somewhere.

How you move data to the other machine with LVM snapshots? -rsync. The way rsync works is to first exchange inventories/manifest of files, and then copy only what’s needed as determined by file name/size/date/attributes. After the copy to an LVM, you can make a remote snapshot, and mount it readonly, same as on the source machine. The files on two snapshots are the same. You’ll likely be running rsync as root in order to get all the filesystem permissions right for all users.
This approach doesn’t work well for VM images or databases where you have data changing randomly in the middle of a large file - rsync will always see something different. If you use qcow2 as backing storage, you can snapshot it from qemu separately right before making an LVM snapshots which will make the amount of thrown away data when you rsync later, smaller.
Databases are their own beast when it comes to incremental backups, there’s different table and journal formats basically you need to copy the whole thing.

Rsync “copying” can be more efficient than transferring a whole file, and it can happen in a patchy/in-place way that doesn’t transfer or rewrite the entire file on the receiving end, thanks to rolling checksums, but it still requires reading the whole file at the source to compute the checksums even if not transferring it over the network.

I don’t know if rsync has a magical flag to help with renames (there’s ways to heuristically detect files that were renamed or moved, e.g. do they have the same inode, are they the same size or such).


I’ve been using zfs and btrfs in small home raid setups for almost 10years, and have worked for a hyper scaler working directly with thousands of highly available SQL databases and also infrastructure underpinning cloud VMs up until a couple of years ago. In hindsight, most outages and problems were caused by our own stupidity and shooting ourselves in the foot and I understand the “stick to what you know” trope, but sometimes you need to invest in learning new things for your own future benefit and especially for the benefit of your work.

… if you’re doing this for others and you have some sense of responsibility, you’d probably want to use ZFS with Linux, or btrfs on that single giant raid, regardless of not having experience with them before. Spend time getting up to speed with them. You’ll find them a lot nicer and more comfortable to use in the long run.

Also, at the end of the day, 50T of data is not that much that you can’t just buy a few hard drives out of pocket and migrate the array from one setup to the next and save yourself a few weekends of grief and anxiety. It might take 24 hours or longer to syphon the data out of the spinning rust (sounds like you need a hot standby for your photographers in addition to backups), but it’s not impossible or like it was never done before by many others.

4 Likes

First, thanks for all of the information.

Which Linux Distros were used for these?

I’ve looked into ZFS on Linux and it appears:

  • There is an ongoing conflict between OpenZFS developers and the Linux Kernel developers.
  • This prevents proper kernel integration of ZFS (like other native file systems have) and requires what is essentially a hack to work around the issue.
  • This may require (re)building ZFS every time there’s a kernel update, and this may be a major PITA if the root file system is on ZFS. Or even if it isn’t.
  • Except some distros, like Ubuntu, may have modified the kernel downstream to work with ZFS.

I suppose this may be why you suggested using an LTS release (and never doing a kernel update without expecting a full migration)?

Would I have less ZFS issues with CentOS 8 or Ubuntu Server 20.04 as the host machine? (In my experience, Ubuntu focuses more on desktop and sometimes botches server things like OpenLDAP. But Canonical supposedly has native ZFS support. I’m not sure of Red Hat’s position on ZFS.)

Do you recommend installing the host/root filesystem on a classic partition like EXT4? And creating ZFS pools for non-OS file storage?

How does ZFS work with KVM? What does the virtual machine see? Would it be possible to get file-level snapshots within the VMs?

For example, if one of the VMs is a web server, and a developer accidentally deletes home.php, can you roll back that file without having to roll back the whole VM?

Because if you have a qcow file formatted as EXT4 for the guest machine, you don’t get any of the benefits of ZFS on your guests… which is where all the data is that we want incremental backups for.

I’ll put headlines to make this more pallatable, it’s clear we have some XY problems here

on ZFS with Linux and packaging

I’ve used ZFS on FreeBSD, Ubuntu, Gentoo, Arch, Debian … I’ve been changing distros and keeping the filesystem. I’m currently on Debian testing (kind of like a rolling release Debian if you can imagine it) and Arch, both with LTS kernel (currently 5.4).

I haven’t really tried centos recently (old strange heavily patched versions of everything), or Ubuntu server (started feeling too patched up, microforked everything, and too opinionated for a distro according to my tastes).

I prefer the rolling release model where I control when to upgrade what and how much effort to put in rather than someone managing me to not go through updates when I have the time, and go through a major upgrade when I don’t have the time, while at the same time holding back features on me - no thank you.

One can always download and build zfs to fit the kernel version they’re using by hand. But, DKMS is a system of hooks for your package manager that does that, ie. it builds the spl and zfs kernel modules from sources (or partially built sources) for you. OpenZFS devs are keeping track of next and rc releases and in my experience they’re doing a better job than most driver/hardware maintainers who maintain out-of-tree code. It’s a big community of actual devs, and so the kernel releases track each other usually such that you’ll have OpenZFS sources updated published on the same day as the kernel. However, they’re still separate releases, and they’re usually living in separate repos downstream and sometimes if you’re using prepackaged/precompiled ZFS your package manager might barf for like a day until the distro maintainers get their act together.

Both Debian and Arch (which is what I’m using with ZFS at home today) are hugely popular, for Arch there’s the archzfs package repo where you get binary ZFS packages (no need for DKMS). I use DKMS on Debian (no special repo, I don’t know if OpenZFS folks are now doing binary packages for Debian). I prefer not compiling my own software if the distro has a release, mostly out of principle – if distros/maintainers can’t figure out and deal with obscure configure flags and compiler errors for you, then what’s the point of using distro packages.

on disk layouts

You could do/store the OS separately from your data, or you could use a bootloader for your kernels, but this is what I have:

All my disks are partitioned using GPT to have:

  • a 256M FAT32 EFI partition
  • a separate 2GB /boot partition as ext4 for the kernel(s) and images.
  • empty space until 10G offset (felt like a good idea after that one time grub2 wouldn’t fit in my superblock all those years ago, 10G == 2cents these days so worth it.
  • rest of space as either zfs or btrfs. I’m doing this strange thing where I have a specific dataset / subvolume for “root” in both cases, and just the os stuff underneath that, no VMs no my own data. That allows me to snapshot the OS independently from most of the data on the system and it allows me to change distros more easily - although that’s always been a pain.

The first partitions (efi and boot) are mdraid mirrored across all drives (yes I have 6 copies of busybox and kernels on the Debian box, oh well). That means the motherboard can pick any drive on startup. Motherboard starts the bootloader, which loads and starts the kernel and it’s zfs containing initramfs from ext4, which then mounts zfs (or btrfs same setup) and pivots root to continue the rest of the boot. Grub is supposed to have zfs support. In theory you can drop it into your efi partition and read the kernel/initramfs from zfs - I just don’t want to rely on it being able to boot from a broken pool, and I do want to be able to boot from a broken pool somehow.

on VMs stuff.

KVM doesn’t care about block devices, the mechanism for that is called virtio (the one that makes sense using anyway). Your VM OS will have a pci device with a descriptor telling it to use the virtio block driver. And this driver will issue a read to the host os kernel virtio stack which will have a driver/kernel module intercept it for performance and redirect the read to either a block device on the host (e.g. a zvol) or to some region of some file or to a user space process (e.g. qcow2).

Host sees the underlying block device/zvol or a qcow2 files, not individual files.

Traditionally, if you wanted more than one machine to use the same files, you’d use some kind of network filesystem between the guest or the host.

If your guest is Linux, there’s a virtio-fs filesystem driver that’s supposed to work well, but I haven’t been able to get it working well yet (I’m trying to get it running with user namespaces and with shell script only - without writing python or c or reaching carefully into syscalls). There’s also virt9p that’s much older but poorly maintained (e.g. ftruncate doesn’t exist in it, except for that mailing list patch from a few years ago, basically it’s trash and not maintained all that well, I’ve booted Alpine linux in it, but it’s worrysome).

You can expose old snapshots of a zvol to your home.php developer somehow (attach an additional readonly block device to their VM, and have them mount it). Or if a developer is storing things on NFS that’s served by the host, then you can give them access to snapshots in a subdirectory of their nfs share, and over that same nfs mount, or make another share with just all the snapshots.

In general, you don’t want to mount guest block devices on the host, it’s possible, but it’s like taking a hard drive from one machine you’ve decided not to trust and mounting it on the other machine that you care about - technically it’ll work in a pinch, but gives me pause.

on “oh-noes I deleted home.php :o” problem

This accidentally deleting home.php scenario gave me pause as all. I’m not entirely sure how the developer team you’re working with is used to doing things, but this “I’ll give people VMs and have them admin them as root and write random files to random places and have them be responsible for admin-ing a Linux system” is a fairly old way of thinking about deploying web apps - it’s not really suitable for working in a group setting where you have multiple people working on a thing together.
Most web developers these days who I know, would already have some kind of version control and would send each other pull requests, and some kind of build system (read: Makefile) that compiles their javascript and CSS and also starts a local docker container on their workstation with a bunch of mounts towards the developers home dir for experimentation and quick iteration on the code.
Once the change is checked into a central git repo, automation (read: post commit hook, but sometimes it’d be a person) would build and push new docker images. These could then be deployed to k8s, or docker pull-ed by anyone in the group who’s interested. Ideally, there’d be some automated testing and labeling of things in between, and some kind of gradual deployment and instrumentation/metrics would be observed during this time when there’s multiple versions running at the same time… but, sadly this doesn’t always happen.

In other words, there’d be no (w)holistic backups for machines (except obviously when this is offered as a service to others in the cloud because they think they need it for whatever reason and want to pay for it, or because having the feature ticks a box in some consultants magic quadrant analysis of orderings). It’s just not all that useful when you have an organization with multiple developers… not sure why.

Databases like MySQL or redis or cockroach; filesystem-ish things like ceph and samba shares and various s3-like storage services would be where most of the persistent data lives, and these would be managed in their own special ways, that are independent of the host kernel filesystem typically; and in fact independent of the hosts themselves. So machines would go up/down get rebooted or not come back sometimes, containers would get restarted and restored from most recent backups or builds all the time, and once in a while we’d get a new version of software deployed. Usually gdpr compliance - ensuring data is really deleted and gone on demand, from all backups and logs is a bigger problem.

Now, if you have <5 websites / VMs total, and some “part-time and cowboy and very set in their ways” developers maintaining them, and you’re rebuilding this infrastructure and their workflows over time, and just need to have snapshot-ability as a transient mechanism for this oh-shit moment, until they can wrap their minds around Git and using e.g. Makefiles and scripts to build/deploy code… I’m sorry, but great! being able to snapshot a ZVOL is actually not a bad place to be in. In their VM they can have more ZFS if they want :slight_smile: or LVM or rsync/rclone/duplicity or whichever.

1 Like

If:

  1. The goal is to have file level revision backups (ie. what did this file look like last week), and
  2. All of the files are on virtual machines (ie. an NFS server for “officey” files (spreadsheets,etc), a web server for web files, , etc), and
  3. ZFS does not play nice with KVM (you can ZFS the host, but can’t snapshot individual files on the guests)

Then the biggest advantage of ZFS, namely snapshotting, is lost and there’s no point in even dealing with ZFS instead of RAID 10+LVM.

In other words, ZFS snapshots only accomplish our goal if they can be used on the guests, and I have found nothing that suggests that ZFS guests are ever a good idea.

That’s not to say that ZFS can’t be part of a complete backup system, but I’m not seeing any advantage of using it as the underlying file system for a VM host machine.


What I’d really like is something like rsync that pulls files incrementally off each VM and stores them to a central file tree on the backup machine under dated folders.

Where, browsing each folder tree shows the full list of files at that point in time.

But also, the files are compressed, but can be browsed without uncompressing and where the backups are “full backups” when looked at, but don’t use extra disk space for files that haven’t changed.

So maybe there’s a solution that combines ZFS with rsync on the backup machine, where rsync (run from a script) pulls everything to a master “sync” tree, then the script tells ZFS to snapshot the tree. Then the snapshot is mounted and served read-only over NFS. And maybe ZFS also dynamically compresses the data too.


Things like raw footage use the most space, but never change (other than adding new files/folders). Databases and VM images will be backup up using special tools to ensure a consist ant state. But all versions live in the same folder, so no need to version control at the file system level. DB files will be gzipped. It might make a lot of sense to isolate the different types of data and have different strategies for each.

A bit of a noob-ish and maybe not so reliable answer, but have you taken a look at timeshift ? if you can do a sshfs to mount the remote folder as local, then i think it does everything you ask for. of course this has a lot of issues like needing passwordless ssh (it could be a dedicated user though), and in the case of something like a ransomware then all files could be lost because you would have write permissions

For large (read: not that many, less than 50million) immutable files, rsync with --link-dest (which will simply share hard link the file from previous backup, and only transfer new/changed files) is definitely the way to go.

If you understand the use case in more detail, it makes no sense to use filesystem snapshots for long term generational backups.

This approach does create one copy on the guest, and one copy of data for all different versions of the file tree on the host.

You can also play with overlayfs down the line to save space, dedup old files. E.g. you’d only keep recently changed files duplicated in the upper directory on the guest, the lower directory could come from the nfs mount from the host.

rsync will work over ssh, but there’s also a higher performance “vhost sock” feature that you can use to bypass the networking stack and ssh encryption and auth, if you want to save some CPU/make it faster, down the line if ssh performance becomes an issue for you. Simple socat with exec:… on the host and RSYNC_CONNECT_PROG on the guest works like a charm for this.

It may make sense (“minor nit”) to let rsync read from a temporary/emphemeral filesystem snapshot from a guest. Half written files, will remain half written, but the order in which rsync visits different files relative to how your user/apps end up editing the files doesn’t matter anymore… ie. with snapshots on the source (even if ephemeral) you get consistency on the order of “guest has crashed this is where it was at”. You can totally do this with LVM, and following backup will copy the full written file, and you can use rsync filters to exclude temporary files matching a certain naming pattern if you know what that pattern is.


ZFS on the host gives you checksumming, detection of and recovery from bitrot that’s integrated with your raid (by far not the only way you can get this, it’s just easy to run scrub). It also gives you that for ZVOLs.

RAID10 + LVM won’t help you recover if your drives start lying to you (worst case scenario - this happens in filesystem metadata - and a directory that read fine a month ago all or a sudden starts causing kernel panics or you get i/o errors). ZFS can detect and correct this from remaining raid data.

In the world of usenet piracy, there’s this well known par2cmdline utility that generates parity data that can be used to check and recover missing data. You could use this utility on files on ext4 on top of a dumb/hardware raid and choose e.g. 1% parity overhead and get similar protection for your file data, but without btrfs or zfs you don’t get similar repairability for your filesystem metadata. There’s also less metadata around, so likelyhood of bitrot in metadata is smaller.

Guest VM isn’t supposed to care about physical storage flakyness and bitrot. The usual ext4 with data=ordered,barrier,discard should be perfectly fine there. (data=ordered,barrier are defaults that make sense for databases, discard aka trim support can help you save some space on the host by overprovisioning).


Rsnapshot and BackupPC are the go to solutions for disk based incremental backups and work regardless of file system as long as they support hard links. If you want to go the tape path there are Bacula and Amanda.

Seems like an interesting option that could be used as part of the overall backup plan. ie. Have workstations run timeshift to backup settings/config to an NFS share when doing software updates, and periodically. Looks like it can also clone a desktop setup, so might be usefull when changing distro.


Related question for anyone, seeing as I have a box with 6 hard drives, gigabit Ethernet, and no data on it right now. What is a good simple tool for bench-marking network and drive read/write speed together? I see hard drive tools and network tools, but I’d like to try some things, like setup a ZFS pool with six drives, served over NFS, and see how fast I can copy files to and from it over the network (ideally using ram on the client so the client’s disks don’t become a factor).

Is it a good test to copy (dd) /dev/urandom from the client to a file on the server and check the transfer speed? It only shows the average when you’re done though.

That would simulate copy a large file or streaming a video (editing scenario).

What’s a good test for read/write speed on NFS for many small files?

Try https://www.phoronix-test-suite.com
No affiliation, never tested it myself (but others have!)

Interesting. It’s built in PHP. I was thinking of writing something in PHP to manage backups since it’s a language I already know. Looks like it’s in the repos too:
sudo apt install phoronix-test-suite

Does anyone know how to do a network test with Phoronix? I got it installed and tried running test #101, “Flexible I/O Tester”. But I’m not sure what it’s actually testing as it didn’t ask me what drive to test. It also seems to be giving reuslts in seconds, not Mbps.

I’d usually use dd and fio (Derp Proof FIO)

There’s also iozone and sysbench, that I know of.

what if you used git for backups…?

No good at binary diffs.

If it’s all text files, it’d do just fine tho.

dd if=/dev/sda | openssl base64 -out /save/file.txt

are you happy now?

You’re an evil, evil man.

1 Like

rdiff-backup gets to the incremental backup part, but I’m unsure about comparing a month-old backup to current.

What is a reliable way to move lvm disk images over a public network? dd wouldn’t have fault tolerance.

Is this actually good practice? dd a volume, base 64 encoded as text, then use differential backup to “commit” the changes? Do we need a checksum?!

It was a joke!!!