This message isn’t necessarily aimed at Wendel, but it’s more me ranting (and some people might find value in this).
I didn’t use PBS, so I can’t comment on that. It’s been a while since I managed a larger infrastructure, but basically I had 2 tiers of backups, OS + data and separate DB backups.
We were using BackupPC 2 (because of its better rsync support compared to 3), to backup linux and windows servers and laptops (all were pull). This worked for baremetal servers too (for obvious reasons). The DB backups were just NFS exports, mounted on the DB servers, where we were doing rman offline backups and pg_dumps, all scheduled from a rundeck server (so in a sense, it was “pull” to send the commands, but “push” when it came to pushing the data, as the servers had write access to the backup location). If I were to redesign it, I’d make it an auto-mount in the script and take snapshots on the NAS side of the “cold” data (i.e. after there’s nothing else moving on the share and the volume isn’t mounted anymore, after the backup completes).
I’m highly opinionated against VM-level backups. At best, I’d grab the proxnox /etc/pve folder and call it a day, because that should contain the VM’s configuration files (like name, backend disk, RAM and CPU sizing etc.). I suppose xcp-ng has something similar (unless it’s in a database format - just dump the db?).
The disadvantage to this simple approach is that you’ll have egregious restore times (RTO). Imagine having to recreate 100 VMs 1 by 1 by slapping its rootfs back on a formatted disk and reinstalling the bootloader. It can be automated, but it should be planned way in advance. Given our small size, we just assumed that if the whole datacenter would burn to the ground, that we’d take about a day to restore critical services and anywhere between 3 days to a week to get everything else up and running in AWS.
But this backup strategy worked well back then, because what used to happen was either someone broke oracle, or they messed with wildfly or tomcat configs and broke them. Recovering with backuppc was only a 5 min operation for configs.
In today’s ransoware-infested world one ought to not expect this level of complacency. A single server getting infected with a worm through a vulnerability (like idk, ssh vulnerability because of a certain compression library and distro-specific modifications, wink wink nudge nudge
) could mean that all your other servers can be infected the same way, then get a ransomware hit on all servers at once.
You make good points, but it’s not so black-and-white. I was about to comment on doing push on primary backups and pull on DR replication, but someone else got ahead of me with a great example.
Today if I were to architect a backup design, I would definitely focus on recoverability and restore testing. IDK any backup solution that does that, so I’d probably need a custom solution (which I’d happily open source).
I still wouldn’t do hypervisor-level backups (snapshots), unless I had a good automation around it, e.g. if I am to take a snapshot, I’d have a command-and-control plane that first runs some preliminaries (like taking a conf dump and a DB dump), then connect to the hypervisor (or NAS) to take the snapshot, then remote back into the server to cleanup, then copy the snapshot or its contents somewhere else. But even this is very inefficient. If I can just backup the data I need (like a gitlab backup archive and a DB dump), then I shouldn’t need to back up the whole OS for this, it makes no sense.
For k8s workloads, I’d back up the PVCs and the pod configuration (like how I launched / built it). Similar for other containers, like incus.
IDK how I’d handle windows, but I’d probably do something similar (I used to enable sshd on windows, so it should be possible to run some PS commands as preliminaries).
My home setup is simpler and I’m not losing money if my stuff isn’t running (technically I lose money if my stuff is running, since power isn’t free, but obviously, my own happiness is more valuable to me than the money I spend on running the things I am). So I just run restic manually when I find it necessary (basically only when I update my keepass db, or add new data that I find important). If I need to recreate any of my VMs or services, I just do it manually and import the data from backups.
But to hammer this point, if you’re losing money by not having your services available, you need both high-availability (maybe even multi-region availability, if you can afford it) and a very good recovery plan with as low RTO as possible (ideally something that you can press a button to recover your whole infrastructure to a target hypervisor or cluster - if you know of something like that, let me in on that, I’d like to know).
Aside from certain ZFS replication (data volumes only), restic with nfs or s3 backend and maybe bacula, idk if there’s anything that I’d have to recommend in particular.