PROXMOX - The Good, the Bad and the Ugly

When Broadcom announced their price increases, I started taking a harder look at FOSS alternatives. The first alternative on most peoples lips was Proxmox and that’s where I began. I’ve played with Proxmox before in the past, but not seriously. My VMUG subscription is ending in August and re-upping didn’t sit right. A month ago, I went all-in and replaced my ESXi installations with Proxmox v8.2 on both of my HPE Gen 10 Microservers (64GB RAM, 4 x 1TB SATA SSD local storage, 1 core, 4 threads).

THE GOOD

Installation was a breeze. The installer is intuitive and straight forward. I pulled the ESXi USB drives from my hosts and replaced them with USB3 carriers with 256GB NVMe I picked up on Amazon in a sale. That let me use all four internal SSD For storage. Installation of Proxmox took about ten minutes with those in place. All of my HPE host hardware was recognized immediately, no driver issues at all. This included the add-in 2 x 10GB SFP+ card I had replaced my RAID Controller card with.

ZFS setup and getting a pool up on the SSD’s was a lot easier than I thought. This was my first in-depth exposure to ZFS (outside of setting up 45Drives NAS at work with an expert on a screen share.) Recovering the ZFS pools was also straight forward. I did this a LOT during reinstalls and I never had any issues recovering the VM’s. PS. backup your vm configs on a USB key. Those files are very small, but will speed up recoveries big time.

The hardware setups on the vm’s were close enough to ESXi that I had few problems getting the vm system hardware up and running. The VirtIO_Drivers ISO solved every vm driver issue I had.

The GUI is straight forward and has a lot of functionality. I have many years of ESXi experience and there are many parallels with it and Proxmox, YMMV depending on your experience level, but I’d say you need an intermediate knowledge level of Linux to run Proxmox. You’ll spend a lot of time in the terminal and need to be comfortable editing config files.

If you want a utility or program, you can ‘apt install’ it like any other Linux. (Bashtop highly recommended for monitoring your hosts.)

git clone GitHub - aristocratos/bashtop: Linux/OSX/FreeBSD resource monitor
cd bashtop
./bashtop

THE BAD

ZFS RAM use is heavy. You can see from the screenshot below, of the 91GB in use on the two hosts only 24GB total (4 x 6GB each) is used for VM’s. The rest is for Proxmox and ZFS. That surprised me as Proxmox says it only needs 2-3%, but does make sense after reading up on how ZFS operates. Up to 50% of the available RAM is earmarked for ZFS. Am glad I upgraded to 64GB per host before starting this.

Error messages are really cryptic. A migration fails and you get a UPID error code twenty characters long when it turns out the error was the vm had an ISO mounted as a CD. Some clearer plain English errors would be nice to have.

Networking was an issue. When testing with iPerf3, if a 1GB connector was plugged in, regardless of if the gateway was on a 10GB connection, the iperf test would always go through the 1GB. No amount of routing fixed this and it was consistent on both hosts. Only solution I could find was to unplug the 1GB completely and then I got 10GB test results. That was the weirdest issue I had.

THE UGLY

Clustering is fragile. ‘Thin-stem wine glass holding up a stack of five encyclopedias in a wind storm’ fragile. You cannot join a cluster if you have vm’s on the host. That caused a lot of juggling of vm’s and cluster creation to get that to work. What was worse was having files that you couldn’t edit or delete (as root) when a cluster join fails. The system simply refuses to let you delete or undo certain cluster files forcing you to reinstall the entire host. I reinstalled at least a dozen times to get around the various issues I ran into.

(Insert Sideshow Bob, rake to the face GIF, here)

It was excellent hands-on experience though. Towards the end, I could get a host up, running and fully configured in 20 minutes from initial boot of the installer ISO.

To get my cluster to stick, I reinstalled both hosts, set them up with networking, recovered the zfs pools on both, joined them to the cluster, then finally recovered the vm’s. That order worked. Once set-up, it seems to work well for migrations, but only vm powered up ‘hot’ migrations for some reason. If you have a vm turned off, it refuses to migrate. Still looking at that issue.

I didn’t try Ceph or HA yet. I have a NAS on order and will carry on playing with those features once I can get dedicated shared storage running on 10GB.

The Proxmox hosts have built in backups, that I set to an NFS share on a spare RaspPi and plug-in SSD on the opposite side of my house. Crude, but effective. I did try the Proxmox Backup server and deleted it after a week. I was expecting a Veeam style replacement and just couldn’t figure out how it integrated. I’ll look at it later, but not a high priority. The individual host vm backups work for now.

Overall, a positive experience. However, were a few hurdles to overcome and I can see it isn’t for everyone. I won’t be going back to ESXi any time soon, but Proxmox wasn’t the silver bullet solution I was looking for. If you want a single host to play with then go for it. Stand alone it is brilliant. The clustering adds a heavy weight.

6 Likes

Most hypervisors require VM’s to be configured for HA/clustering migration when first initialized.
And all hypervisors require VM’s to be powered down when initializing or joining the cluster, domain, etc.

ProxMox is not yet a drop in replacement for ESXI in Windows domain environments, but it soon will be. If you have mission critical environments your 2 options right now are:

  1. write blank check to Broadcom for the revokable privilege of using their software on which you built your enterprise

  2. switch to Server Datacenter and HyperV then pray ProxMox reaches a mature enough state before you are required to have an active Azure license to boot your on-prem Windows Server installation

It’s a race to the bottom for hypervisors and we will hopefully see an unbelievable transfer of wealth before the dust settles. Microsoft will continue outsourcing coding until nothing remains and Broadcom will keep squeezing users with price gouging.

For now, Microsoft is somehow the lesser of the 2 evils… and I now have to go wash my hands.

8 Likes

Yep, they were.

Being able to enable live migration on a VM post creation is witchcraft Microsoft has yet to master.

ESXI can sometimes to it, but depends on your vcenter configuration.

The fact ProxMox can do it at all is dark magic that future sys admins will take for granted.

I have yet to review any ProxMox code or dig deep as I am already wading through a sea of small and medium businesses that have to switch from ESXI.

Broadcom timed their price hikes at an inopportune time as people were still upgrading to meet ESXI 8.0’s hardware requirements. Server 2022 has much lower and we can enable trusted guardian with TPM 2.0, SR-IoV, and IOMMU.

We are exploring ProxMox as a potential backup/logging solution as hardware RAID is truly dead and ZFS is now a minimum system requirement.

3 Likes

Proxmox 8.1 fresh install has a ZFS memory fix that limits the memory usuage to 10% of the system memory. If you upgrade Proxmox then you will need to manually create the zfs.conf file.

Their is a small mention of this on the Proxmox Wiki - ZFS_on_Linux - sysadmin_zfs_limit_memory_usage

For my GTR7 Pro system with 96 GB of RAM using 2 x 2 TB 990 Samsung Pro drives:

/etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=8589934592
9 Likes

‘Hot’ Power on migrations are typical from hosts in the same cluster. ‘Cold’ power off migrations work between clusters. The key is to have identical networking setup on all hosts. Same vlan numbers, same names, same settings and you’ll have no issues.

I’ve done hot migrations between clusters, but again, the networking has to be exactly the same on both clusters. (The relevant physical switches in-between the two sets of hosts also need to support the networking vlans as well)

Tip: Migrations are a also good time to change thick provision to thin to save yourself some space.

4 Likes

I strongly recommend against running thin provisioning if the client has the hardware to support it.

Backups take far longer and if defrag is ever invoked within the vm, it balloons to max size which in an over-provisioned system will take down the entire cluster.

anecdotal evidence:
I had a client call stating their server with 24 TB usable storage was full.
Standard use in a medium sized construction business (accounting and job information). They had around 1 TB of data but the original sys Admin provisioned a single 6 TB vhd per VM and a significant Windows update caused them to all expand to max size.

I had to manually resize the VHDX files to shrink in-place as 30TB being transferred from 5 drives would simply take too long and this operation was losing $120k per day of downtime. After shrinking, I added virtual drives to each VM for data and shrank the OS’s down to 200 Gigs (one would not go below 1.2 TB without corrupting the install).

Swapped them from 5x HDD’s (don’t use RAID 5) to 4x 8TB SSD’s.

Keep your provisions appropriate and avoid over provisioning.

Besides that, there is a decent performance improvement.

4 Likes

I’ve used VMware for a really long time, like nearly a decade at this point. I don’t think I’ve seen VMware have a problem onboarding a host based on what VMs are running, nor any issues with migration. The key with migration for VMware is to set an interface as “vMotion” where you want the data to flow for that purpose. Beyond that, migrations just work in VMware. I don’t know what kind of black magic they’re using to do it, but it’s there.

Microsoft Hyper-V migrations are actually pretty robust, but getting it working is black magic unto itself. Since it’s all based around Windows (specifically Windows server), permissions, predictably, are a nightmare. You have to have the right services running, and the correct system permissions to communicate between the hosts. Once it’s setup, migrations, whether live/hot, or cold/“powered off”, are pretty seamless. They even have an option for “shared nothing live migration” where the two hosts don’t share any storage/compute and the migration “just works” (after days of troubleshooting to get it operational).

I haven’t used Proxmox yet myself, I’ve heard a lot about it, but I’m not sure it’s the right fit for me yet. I’m leaning more towards XCP-ng. I would have liked to use OpenStack, but in true linux fashion, the entire system is a confusing mess to try to get going. All the “simple” OpenStack guides are extremely limited in the capabilities you can use (clustering, etc), without basically having a degree in CS and years of experience with OpenStack (don’t get me started on the unhelpful error messages).

I recently updated my lab from a cluster of Dell c6100, to the Dell FX2s chassis with some FC630 nodes (only one is populated and powered right now). The upgrade was such a significant leap that I finished one node and it had more than enough compute/ram/capabilities that it replaced my entire 3-node cluster. Unfortunately, still running ESXi (my vMUG subscription expires later this year, I will not be renewing). By the time my ESXi 7 vMUG license for my current FC630 node expires, I want to have at least two of my other FX2s FC630 nodes online, installed with something that’s not VMware, and all my VMs migrating to that new cluster. After the VMs are migrated, the third (of the three I bought so far) of my FC630 nodes (the one currently running VMware ESXi) will be joining the cluster.

I’m here because I wanted to know all the pitfalls of Proxmox when operating in a cluster, and the answers I was seeking are here, for better or worse. I’m still leaning towards XCP-ng, but Proxmox is probably something I’ll evaluate at least. My big pinch is that the FC630 only has the boot drive (a 2x2.5" SAS array - SATA capable) for storage, all other storage needs to be either iSCSI/NFS/Ceph/network-based or FC (including FCoE or similar); aka, something SAN-like. Right now I have a 6-drive RAID 6 of 4TB spinning rust to run the VMs in the cluster, on another system over iSCSI, which isn’t great; this is another problem I have yet to solve. I’d like to move my storage to all-flash, whether that’s SATA/SAS SSDs, or NVMe… the jury is out at the moment. I also don’t have the money to make that move anytime soon, so I might be stuck on spinning rust for a while.

Which brings me to storage. Over-provisioning storage is, IMO, generally a bad idea. True story: I had a client (my dayjob is in IT support) who had a FiberChannel SAN, connected to a small cluster of VMware ESXi hosts, and one of their data storage VMs had a disk so large that you wouldn’t be able to consolidate it. It was thick provisioned (done before my time), and had been increased/grown to the point that there wasn’t enough free space to support consolidating the disks. So I set the disk to an independent mode in VMware to prevent any snapshots from happening on disk, which would not be able to be consolidated. For those who don’t know, VMware, when consolidating a disk, especially live (I don’t know if the offline behavior is any better), will write out a new vmdk file of the disk to consolidate the base disk and any snapshots into a single file, then push the VM to use the new base disk, and wipe the old data. This seems reasonable until you realize that it will consume AT LEAST the disks provisioned space to do the consolidation. So if you have a 1TB vmdk with snapshots, you require at least 1TB of free disk space in order to do the consolidation. I had an awkward email exchange with the client many months later when they hired an IT guy who was trying to implement VEEAM, which heavily relies on snapshots for backups. I explained to him that was a terrible idea, and that he needs to either upgrade the storage so he has the free space that’s required to do it (which is still a bad idea, since the consolidation is going to take forever to complete), or backup that disk another way. I included full details as to why. My action to make it an independent disk, and his initiative to ask before changing the disk away from independent mode, probably saved him from filling up whatever free space they had left with snapshots for that disk and crashing everything.

I’m not a fan of thin disks, and I prefer to see at least the largest virtual disk in free space on the array, anything less is asking for trouble. My advice is that if you need significant disk storage on a VM, to use an independent disk, such as an RDM, or NAS or something that’s separately backed up.

In any case, overprovisioned thin disks is trouble in the making, it only takes a handful of VMs to run defrag or some other full-disk operation, and suddenly your datastore is out of space and nothing can write to disk and the entire cluster stops operating as a result. Avoid it if you can. The only exception to this, where I think thin provisioned VMs are fine is in VDI, where the disk can be trimmed after the user logs out of the VDI. Templates and golden images should also be thin, since who wants to waste a bunch of otherwise unused space on a template that will never grow in size? I’ll give a big shoutout to anything that does dedup, since zoroed space in a dedup storage should never take up extra space, so using dedup makes my entire argument meaningless. I don’t know of any small to medium business or home lab type environment that implements it because of the cost of doing so.

I’m still on the fence about what to move to, I’m planning to test XCP-ng, Proxmox and a few others when I outfit my other FC630 nodes. All I know right now is that I will be dropping VMware.

3 Likes

As someone who has used OpenStack in production, let me strongly pull you away from it. OpenStack is designed as a Private Cloud solution for large datacentres, IE: AWS in your DC. So it has a minimal UI and an emphasis on API. It also has many distributions and almost every install will end up as a unique snow flake due to its incredibly large adaptability. And while it claims very large hardware support, the truth is many of the modules are only used / maintained by one large customer that relies on it. If you’re unlucky your distribution of choice may not support that module, or have many bugs with it.

IMO Proxmox provides the best out of box experience right now, and has the most promising momentum behind it. As long as you can adapt to some sharp corners it can work very nicely as an ESXi replacement. Like everything, since everyone has different requirements, you’ll find a lot of variation in opinions on it.

One large pain point is SAN shared storage, Proxmox just doesn’t have a good answer for VMFS. The best it has is Clustered LVM, which lacks thin provisioning and snapshots. So it’s a tough sell if you’re still relying on a SAN.

Though as a plus, it’s nice to have good software RAID support with ZFS. Unlike ESXi.

Personally I’ve never used XCP-ng, so I can’t comment there.

3 Likes

My limited understanding of Proxmox and XCP-ng over the last year breaks it down like this:

Proxmox - Home lab, enthusiasts and SMB. Very tweakable and you can modify as needed to fit your needs. Works across most residential / commercial hardware.

XCP-ng - Enterprise focused. Large scale and rigidly controlled. Not as flexible and development is a lot slower than Proxmox.

Both options have subscription support which can get pricey at the upper tiers.

1 Like

You mentioned your VMs using 6GB each and your ZFS RAM using being high. Are your VMs paging a lot and could this be related to the high ZFS RAM usage?

I’m a literal baby in the Linux world so forgive me if that’s a silly question but I have a fair amount of experience with Windows. My home desktop is using 14GB with a few browsers, ConnectWise Manage and Outlook. All of our VMs in the office are running 32GB for mostly the same setup.

2 Likes

I have been standing up TrueNAS Scale deployments everywhere for the changeovers as ESXI only likes iSCSI for baremetal backups.

40+ gig cards are cheap so a dedicated flash NAS with plenty of networking can be had for little more than your array + 2 drives.

For 16 TB usable storage with 40 gig to each of your 3 nodes, you are lookin at a pretty cheap setup. Realistically you could build one for around $2k.

Your story about snapshots reminded me of a client complaining about vm performance. I found 9 nested snapshots on 1 VM with a similar problem where the volume could not consolidate due to over provisioning.

I migrated the vm to a NAS, consolidated, resized, and moved back. What should have been a 2 TB vhdx required a 16TB NAS to unfuck.

1 Like

My experience with Proxmox has been mostly positive. In my experience it has never had any critical failure and it is mostly stable. However, ZFS can be a massive pain in the ass and you should be careful using without understanding it. Also it is separate from the Linux kernel thanks to terrible licensing from Oracle. The good news is that LVM can easily do raid and has equivalent performance with low overhead. The downside if LVM is that you lose the ZFS bells and whistles.

One of the issues I ran into is that it is very easy to customize. This is a blessing in many ways but it also has the side effect of making it very easy to blow off your own foot down the road.

The almost deal breaker for me is the issues with ssh. Proxmox runs all of its operations over ssh so it needs to work between hosts. However, if your key changes for whatever reason everything will fall apart.

One of the things I want to try is using Ansible for automation. Ansible should make automating VM creation a breeze.

2 Likes

Fair observation, but no. I had the same amount of RAM on ESXi (I wanted an apples to apple compare between the two.) On the vm side, the RAM use is within 3% when they were on ESXi.

They are a pair of W2019 DC’s for my home domain and two W10 vm’s for testing various things I can’t do on my work network.

2 Likes

Fully agree. My home lab could crash and burn completely and I would (and have) rebuild hosts and vm’s up from nothing in a few hours. That’s something I can live with along with the associated risk but YMMV depending on your own needs.

If a vm or host corrupts, I’ll use it as a training exercise to get it back up and running. If I can’t, I have backups for the stuff that does matter or is too time consuming to replicate.

1 Like

ZFS RAM use is heavy

This is just for ARC. Unused RAM is wasted RAM, after all.

If you really need to constrain it, there’s a zfs tunable you can set to adjust the max size of the ARC.

Personally I usually just leave it at default and have not had an issue.

4 Likes

What? No.
Only if you are using EVC

???
Read Live Migration Share Nothing

Go ahead and join a Windows machine to a domain or cluster with a VM running and let me know how that works out for you.

1 Like

There’s way more good things about Proxmox and way more bad things about it. I wish it’ll get better and hope it’s going to be a successful project for years to come. I’m personally not recommending anything less than 5 nodes in a proxmox cluster (because of my own bad experience with 2 nodes failing at once and being unable to add more hosts because the single working node thought it was the node that went down and refused to allow you to modify anything, or add new nodes).

Hyper-converged infrastructure is really straight forward with proxmox. Using multiple NAS’es with a proxmox cluster is also really great (I used qcow on nfs and live-migrations were a breeze). Even with local storage, live migrations are still easy (just that you need to wait more to transfer the virtual disk). I’d think iscsi would also work good on proxmox.

1 really dumb thing I encountered was being unable to clone a VM with a zfs backend, because of some arbitrary error that an efi disk can’t be cloned. I literally done it through the terminal with zfs send / receive, created a new VM with the same hardware, attached the disks and it worked. Literally less than 10 commands and it was done. Why proxmox couldn’t do it, no clue.

Because I like minimalism and modularity, I don’t like that proxmox is more of an appliance and you’re tied to debian. It’s understandable that the proxmox team doesn’t want to duplicate their efforts to provide support for multiple distros (and rightfully so, they shouldn’t, they should stick to making a good platform), I don’t blame them for that. But I wish proxmox was a bit more portable.

However, I love that proxmox allows you access to an unrestricted root shell on the OS and its base is just pure debian (if you don’t count the proxmox linux kernel). For the longest time, truenas scale (also based on debian) didn’t allow you such access (idk if that’s still the case, don’t care, because I don’t like and don’t recommend truenas anymore, at all).

I’m still begrudgingly recommending proxmox to people, because I really think there’s no better alternative. OpenNebula could’ve been a great alternative and is relatively straight forward to deploy, but they decided that you need a subscription to update if you’re a business. I’m not a business, but out of principle, I don’t like that freedom restricted.

3 Likes