Proxmox -- Backup Solutions?

I’m doing some research on backup and restore strategies with Proxmox and I have strong opinions.

But I thought I would ask –

Are you using Proxmox or XCP-ng in production for production workloads? If so, what are you doing for backups?

Did you try Features - Proxmox Backup Server ? Are you using it? If not, why not?

If you had a storage appliance that ran proxmox backup server as a VM and exposed the storage appliance’s local storage, would you use that? (this is a kind of tangent I’m working on for a video because it enables a cool disaster recovery option)

5 Likes

Backups? What are those?

Just kidding I use the native Proxmox backup feature to backup VM’s to external storage. It is the simplest approach in my experience even though it does tend to eat though disk space depending on how you set it up. My personal preference is 1-2 full backups on external storage with snapshots on the backup device.

With that being said I don’t really care much about backups on a VM level. My stuff is all containerized on the guests so I just backup the storage for the containers and the relevant configuration files. If something goes wrong or I accidentally delete a VM (done that) I can just rebuild the VM and start over. This process is easy to automate with Ansible although I haven’t gotten that far yet. Network Chuck has a video about using Ansible to create a VM from a template and then setting it up automatically. (You can use Ansible to get the IP from Proxmox and then run playbooks against the newly created VM in a automated fashion)

4 Likes

My main cluster is a proxmox based cluster with ZFS and Ceph for guests and ZFS+Samba for file storage. I had a look at the Proxmox backup server when it first came out, but seemed a bit incomplete at the time. Their block based approach for backups seems pretty useful, but because I’m pretty heavy on ZFS I never gave it more than just a look. Plus my main backup target is a physical machine I like to turn off, at the time turning PBS off caused some configuration troubles (I don’t know if that still the case)

My current solution is:

  • Some guests live on Ceph, the backup to a ZFS target in the cluster
  • the ZFS file storage is sent via syncoid or manually to a ZFS target on a different server. This then includes backups of Ceph based guests and my other file storage.

One kinda important consideration for me is that I know ZFS so in case of disaster I know I can get to my files without needing a new proxmox cluster or something.

If you had a storage appliance that ran proxmox backup server as a VM and exposed the storage appliance’s local storage, would you use that?

I considered setting that up myself, but part of what is important to me of the last part of my backups is that they are minimal fuss, and adding more layers of abstraction doesn’t always fit that bill

1 Like

I use XCP-ng and backup to a Truenas Core/Scale with an NFS share

I use PBS in my lab an like it a lot. Not production tho.

Why Samba? Not that it is bad or anything but it isn’t unix focused. Wouldn’t it be better to use NFS or even iSCSI?

I personally am looking into using FreeIPA to do device level authorization with NFS. The idea is that I can grant NFS access to only the stuff that needs it.

Did you try Features - Proxmox Backup Server ? Are you using it? If not, why not?

Yes, and I refuse to use Proxmox backup because it’s a “Push-Only” system: you have to give ACL access to your backup server to all your servers!

In CyberSec, we do not allow such blasphemy! When ransomware hits, your backup servers would most likely get hit as well.

Instead, your backup server should be isolated with zero access from any of your routers and servers.

You could either stage your files onto a centralized storage and then grant the backup server ACL rights to that one server to “pull” from. If that one server gets compromised, only the most recent backups are compromised.

The other option is to grant ACL rights from your backup server to all of your servers to “reach into and pull” the backups. I prefer this “centralized” approach in my Proxmox setup, as I can log into one centralized backup server and see everything I am pulling, listed in one spot. Usually running a bunch of rsync and ZFS clone (pull) jobs on the backup server.

Technically, I have another virtualized layer on top of all of this: I actually run three TrueNAS instances on two proxmox nodes (KISS):

  • TrueNAS-A for zfs-only management of the data: disks, pools, recovery, and shares. This is considered the “source of truth” of all data, and even a few iSCSCI’s on isolated networks there.
  • TrueNAS-B for all of my apps, docker, jobs, plex, and other funsies. Purely transient: all data is mounted from TrueNAS-A.
  • TrueNAS-Backup running on the backup server, that has ACL rights to TrueNAS A and B to “pull” data from (as well as the router, switches, etc).

The underlying Proxmox nodes are all dumb, just running the TrueNAS and virtualized router VM setups. I try not to grant access to the Proxmox nodes from the TrueNAS virtualized layers in case something gets compromised at either end.

1 Like

I run Proxmox at home on one of those Erying AliExpress Motherboards. In terms of backups my strategy is this:

All of my containers/VMs are either not a big deal to lose, or take minutes to reconstruct. For storage of data, they all use a pair of 4tb HDDS that they have bindmounts for. The data is split into two piles that I call Burnable and Backed Up. I then have a container whose whole job is to run Restic once a night, backing up all the data in the Backed Up pool to Backblaze B2.

If I lose the SSD that Proxmox runs on, it’ll be a fun project getting that back up and running with a new SSD, and recreating the containers and VMs will be laborious but not complicated. Even Networking is relatively simple since every service I run that needs to be accessible outside is exposed via Tailscale sidecar containers, so in theory I can just get the containers running again and they’ll appear at the same Tailscale domains they were on before. If I lose the hard drives, oh well, the burnable data is not stuff I’d be that upset to lose anyway.

I’m not sure if this qualifies as KISS, though!

2 Likes

I use Proxmox to host some Win 11 desktops and Srvr 2k22. I have been using Acronis for backups but I am moving to PBS. I have the on site part working and am setting up the off site part now. I like the idea of pulling the data from the the off -site. PBS does seem a little like a tea kettle, it takes a lot longer if you watch it.

We use Proxmox Backup in production and restores have been very reliable. The features such as live restore, file-level restores, etc. all require the hypervisor to have access to the Proxmox Backup system. So, from this standpoint, it is a feature, not a limitation - providing very fast business continuity when a VM needs to be recovered very quickly (a minute to boot from a point-in-time image). I agree that it opens a hole from a cyber security standpoint, but continue reading…

The replication process is pull-only, so we have a secondary Proxmox Backup systems that hide behind a NAT from the production Proxmox backup systems. This allows us to restore to secondary clusters that are used for disaster recovery - similar to your TrueNAS-Backup configuration - where if a compromise occurs, the DR configuration couldn’t have been poisoned, deleted, etc.

The DR configuration also has a different/larger retention, so deletions in the production Proxmox Backup system via compromise can’t affect/delete data from the DR configuration. Plus, read-only permissions are granted to the DR system so if it was compromised somehow, it couldn’t change data on the production system.

The DR Proxmox Backup is, or can be, replicated off-site depending on the clients’ needs - replicating from the DR images, not production, in the same pull-only approach using a read-only permission.

3 Likes

Because the fileserver part of my cluster has Windows clients too :slight_smile:

So then its a bit simpler to keep it only Samba without iSCSI or NFS. (last time I tried iSCSI with ZFS on Proxmox it didn’t work too well either)

We use Acronis Cyberprotect Agents inside our Windows and Linux VMs to back them up to NAS Systems and the Acronis Cloud. We would love if acronis would support VM Backup on proxmox, i already told them that.

We also additionally use Windows Server Backup to the same NAS System, if there is enough space remaining, because its free and works quite well.

In my homelab i just use the backup function on proxmox once a week. My file shares get backed up by an rsync cronjob on my fileserver vm. Thats the only data i backup daily.

A colleague of me uses PBS to a nas at home and is really pleased by it. He recovered systems multiple times using it.

This message isn’t necessarily aimed at Wendel, but it’s more me ranting (and some people might find value in this).

I didn’t use PBS, so I can’t comment on that. It’s been a while since I managed a larger infrastructure, but basically I had 2 tiers of backups, OS + data and separate DB backups.

We were using BackupPC 2 (because of its better rsync support compared to 3), to backup linux and windows servers and laptops (all were pull). This worked for baremetal servers too (for obvious reasons). The DB backups were just NFS exports, mounted on the DB servers, where we were doing rman offline backups and pg_dumps, all scheduled from a rundeck server (so in a sense, it was “pull” to send the commands, but “push” when it came to pushing the data, as the servers had write access to the backup location). If I were to redesign it, I’d make it an auto-mount in the script and take snapshots on the NAS side of the “cold” data (i.e. after there’s nothing else moving on the share and the volume isn’t mounted anymore, after the backup completes).

I’m highly opinionated against VM-level backups. At best, I’d grab the proxnox /etc/pve folder and call it a day, because that should contain the VM’s configuration files (like name, backend disk, RAM and CPU sizing etc.). I suppose xcp-ng has something similar (unless it’s in a database format - just dump the db?).

The disadvantage to this simple approach is that you’ll have egregious restore times (RTO). Imagine having to recreate 100 VMs 1 by 1 by slapping its rootfs back on a formatted disk and reinstalling the bootloader. It can be automated, but it should be planned way in advance. Given our small size, we just assumed that if the whole datacenter would burn to the ground, that we’d take about a day to restore critical services and anywhere between 3 days to a week to get everything else up and running in AWS.

But this backup strategy worked well back then, because what used to happen was either someone broke oracle, or they messed with wildfly or tomcat configs and broke them. Recovering with backuppc was only a 5 min operation for configs.

In today’s ransoware-infested world one ought to not expect this level of complacency. A single server getting infected with a worm through a vulnerability (like idk, ssh vulnerability because of a certain compression library and distro-specific modifications, wink wink nudge nudge :wink:) could mean that all your other servers can be infected the same way, then get a ransomware hit on all servers at once.

You make good points, but it’s not so black-and-white. I was about to comment on doing push on primary backups and pull on DR replication, but someone else got ahead of me with a great example.

Today if I were to architect a backup design, I would definitely focus on recoverability and restore testing. IDK any backup solution that does that, so I’d probably need a custom solution (which I’d happily open source).

I still wouldn’t do hypervisor-level backups (snapshots), unless I had a good automation around it, e.g. if I am to take a snapshot, I’d have a command-and-control plane that first runs some preliminaries (like taking a conf dump and a DB dump), then connect to the hypervisor (or NAS) to take the snapshot, then remote back into the server to cleanup, then copy the snapshot or its contents somewhere else. But even this is very inefficient. If I can just backup the data I need (like a gitlab backup archive and a DB dump), then I shouldn’t need to back up the whole OS for this, it makes no sense.

For k8s workloads, I’d back up the PVCs and the pod configuration (like how I launched / built it). Similar for other containers, like incus.

IDK how I’d handle windows, but I’d probably do something similar (I used to enable sshd on windows, so it should be possible to run some PS commands as preliminaries).

My home setup is simpler and I’m not losing money if my stuff isn’t running (technically I lose money if my stuff is running, since power isn’t free, but obviously, my own happiness is more valuable to me than the money I spend on running the things I am). So I just run restic manually when I find it necessary (basically only when I update my keepass db, or add new data that I find important). If I need to recreate any of my VMs or services, I just do it manually and import the data from backups.

But to hammer this point, if you’re losing money by not having your services available, you need both high-availability (maybe even multi-region availability, if you can afford it) and a very good recovery plan with as low RTO as possible (ideally something that you can press a button to recover your whole infrastructure to a target hypervisor or cluster - if you know of something like that, let me in on that, I’d like to know).

Aside from certain ZFS replication (data volumes only), restic with nfs or s3 backend and maybe bacula, idk if there’s anything that I’d have to recommend in particular.

1 Like

I’ve really been trying to approach a cattle vs. pets approach with my systems over the past 8 or so years, even for stateful workloads, so solutions like PBS feel a bit like “yesterday’s BDR model” for my workloads. I’m currently using Proxmox in a couple hosting environments, but I’m working hard on an Incus spike right now, because there’s some really exciting stuff going on in that neck of the woods. All that said, because even my stateful infrastructure is relatively easily recreated, backups with PBS aren’t super attractive because my RTOs for a full site failure aren’t that high, and restore testing outside of the largest data sets can be achieved with similar deployment tooling that I use to deploy and update systems to begin with. The backups I manage with Borgmatic, but restore testing is a bit manual, still haven’t gotten around to building something for that, but at the very least, it’s good exercise for a full or partial restore, even on new hardware.

We use the Backup feature in Proxmox and Bareos. The VM’s run on a ZFS storage pool of NVMe devices. We have local magnetic SATA disks which the Proxmox backups get backed up to on a daily basis, backups are rotated over different physical disks. Then we run weekly remote backups using Bareos that read the backups from magnetic disk and copy to S3 storage.

At work we’re going with Rubrik, they seem to be best. Looking it appears they are working on adding Proxmox support, not sure if its there yet. I don’t think they support XCP-ng.

This is one reason why we are going with Rubrik. When you restore with Rubrik, it presents itself as NFS storage to your VM environment so you can instantly restore VMs and then later do a storage migration off of Rubrik to your VM environment.

First, I’d note that this is all home lab and not in “production” so to speak, but I have been regularly using PBS.

I have been using the PBS system for over a year now to backup some of the critical VMs on my cluster. My cluster is spread across 5 hypervisors in the east coast US, 1 physical hypervisor in Japan and one cloud VM hypervisor in Japan. All of them are linked together in one cluster and I have backups set to run from the 5 nodes in the US to the one physical node in Japan.

The PBS is running as a VM on the physical hypervisor in Japan, and it has no special passthrough or anything setup for storage. It is just using normal virtualized storage. The main storage drive is using ZFS, so the overhead of ZFS on ZFS isn’t great, but my use case isn’t very demanding.

The backups I am doing are rather light, and aren’t particularly fast as the highest speed I can reasonably achieve with the 160-170ms ping is about 4MiB/s. That being said, I have bi-hourly backups configured for the redundant PFSense VMs I have and a daily backup for a Windows DC and they generally run fast and without any real issues. The backup itself runs incrementally and only actually transfers about 2-4GB of the 32GB OS disk, and this normally only takes a few minutes. For example, they both just ran now and finished in 4m 3s and 4m 10.5s respectively.

I have had a few failures in the VMs that have required me to recover both the Windows DC and one of the PFSense VMs at one point or another and apart from not being able to successfully use a live recovery for the DC(160ms drive reads unsurprisingly don’t work well), both of them worked flawlessly.

I also setup scheduled rules in the PBS to do hourly verifications of all new backups and have it re-verify every 30 days.

Over all I’m very happy with it.

I’ve deployed Proxmox a bit professionally, and mostly rely on Proxmox snapshot replication, builtin Proxmox backups, Veeam agents, borg backup, etc. However now that Veeam has native Proxmox support we’ll likely use that more in the future. I’ve been looking at PBS, but haven’t had the right situation to deploy it.

I find backup program choice is often limited by other requirements. Backing up to tape? Replicated for DR? Need file level restore? Immutable backup copies?

If you had a storage appliance that ran proxmox backup server as a VM and exposed the storage appliance’s local storage, would you use that? (this is a kind of tangent I’m working on for a video because it enables a cool disaster recovery option)

We recently tested a similar scenario to find out whether Proxmox can be deployed in our network as a means to replace the current VMware + Veeam Backup setup. We are administrating several smaller sites with 2-3 servers and NAS storage. They are too small to financially benefit from a SAN infrastructure, hence all ESXi hosts run standalone.

Our test rig runs 2 Proxmox VE servers and Proxmox Backup in a VM which is being replicated between the two. Backup storage is served by a NAS via iSCSI. In short, we concluded that this setup offers pretty much the same features we get from Veeam Backup but with more cost-effective / less restricting licensing behind it. it’s also well integrated into the Proxmox GUI.

The big upside of this approach is modularity. You can take several external NAS boxes and put them in different physical locations without needing an extra server that will otherwise be idle, most of the time. We even tested total disaster recovery i.e. re-installing the whole cluster + backup VM. Re-attaching the previously made backups was as easy as adding a couple lines in a config file. After that, we were able to replicate the entire cluster from backup.

The only drawback we found so far is the necessity to form a cluster if you want to control all hosts in a unified interface. That adds to the complexity as we now have to deal with i.e. quorum which is annoying and kinda unnecessary without SAN. I guess this is a much less of a concern in an actual datacenter environment, though.

Edit: Typos

1 Like

You can setup an external QDevice on a minipc or rpi, Cluster Manager - Proxmox VE. That way a 2 node cluster is quite practical.