Background
I’m at a small business that’s running off of a Windows Server 2016 tower that the boss put together (he even modified the case to fit two AIO watercoolers in). It’s got loads of random software running on it, it’s running out of storage space, Windows updates aren’t applying cleanly anymore, I don’t know much about Windows administration, and yes we had a ransomware attack. The boss is paranoid about the cloud but I convinced him to order two new rackmount machines and I’ll be migrating things over to some sort of Linux server infra. I’m a software dev so I have some knowledge of administration but not very deep. I don’t have much experience with all the various software options so it’s hard to evaluate which are best for our needs. Wanted to ask y’all what you think. (Also just writing it all out will be helpful anyway).
New Server Specs
Two identical servers. One is (offsite?) backup/spare/failover.
- Supermicro A+ Server 2014CS-TR - 2U - 12x SATA/SAS
- AMD EPYC 7543P Processor 32-core 2.80GHz 256MB Cache (225W)
- 8 x 32GB PC4-25600 3200MHz DDR4 ECC RDIMM
- 2 x 800GB Micron 7300 MAX Series M.2 PCIe 3.0 x4 NVMe Solid State Drive (Boot drive, VMs and DBs)
- 2 x 280GB Intel 900p Optane SSD U.2 in PCIe-U.2 Risers (ZFS SLOG and Metadata vdev)
- 6 x 14TB SATA 6.0Gb/s 7200RPM - 3.5" - Ultrastar DC HC530 (512e/4Kn)
- Supermicro AIOM OCP 3.0 - 2x 10GbE SFP+ - Intel XL710-BM2 - PCI-E 3.0 x8 - AOC-ATG-i2SM
- Supermicro AOM-TPM-9665V-S - Trusted Platform Module - TPM 2.0 - X10 X9 MBs - Server Provisioned - Vertical
Workload
Typical small business stuff. Nothing particularly difficult.
- Nextcloud instance for ~10 employees
- WordPress site that sees no traffic
- Node.js web application that services a few dozen internal and external users
- Webapp connects to Excel running on the windows server host to do some excel processing (ugh, will migrate to a Windows VM)
- Uses MariaDB database of moderate size (3.5GB SQL dumps growing ~200MB/mo)
- Accesses a few million small Excel and CSV files for processing.
- Streams video files (~11TB of video files Growing ~1TB/mo)
- Nginx reverse proxy
There are a few more services I’d like to run but couldn’t be arsed to install them into our current mess.
Infrastructure Software
There’s a lot of ways to manage some servers with some VMs and containers, and it’s hard to choose between them. Let me emphasize what I’m specifically concerned about.
Goals
- ZFS Filesystem for storage. I’m already familiar with it. Snapshots. Send/receive. Transparent compression. Metadata cache to speed up the webapp finding random files.
- Simple, reliable, (immutable?) host OS/hypervisor. Reduced attack surface. Long term support. Almost never reboot the host machine.
- Infrastructure as code. I want to be able to apply the same config to both servers. I want to have a single source of truth for all configuration for all software on the hosts and inside containers and VMs. I want to avoid coming back to the server that I haven’t touched in months and try to remember which config file I edited or whatever. All the config lives inside one git repo with commit history. When I leave, someone else should be able to look at the repo and understand how things work rather than going spelunking in VMs trying to find every little necessary config change that needs to be done for it to work.
- Good monitoring dashboard and alerts. Anticipate hardware failures. Solve problems quickly.
- Ideally Active/Passive failover to the offsite backup host using e.g. CloudFlare DNS and host monitoring.
- Test environment for testing changes before deployment.
Options
I’m looking for something that’s between the old-school manual hand-holding of a few servers and the new big complicated enterprisey ways to manage thousands of servers. Here’s what I’ve considered so far:
Expand for thoughts about the different options I've considered
NixOS
I’ve been playing around with this and it’s more usable than I expected.
Pros:
- Single configuration language declaratively specifies packages, configuration, etc. for an entire system
- Atomic upgrades and rollbacks for packages and configuration
- Declaratively specify containers and VMs with the same system. Move services between the host and containers/VMs with ease.
- Tools like NixOps for provisioning many machines with shared configuration.
- Availability of up-to-date packages.
Cons:
- No long term support for the host system. Need to upgrade yearly.
- A niche and somewhat difficult system to leave behind for the next guy.
- No professional security team for packages. Security patches may take longer to release than other distros. (You can “easily” make a custom version of a package which includes the security patch if you stay on top of things but I haven’t tried this yet and I’m scared).
- A lot of change in the ecosystem with things like flakes and changes to containers and packages in flux.
- Poor documentation and complicated command line interfaces.
- Things are fun and easy until you stray off the beaten path or something breaks and now you need an intimate knowledge of both nix and the software its configuring. Don’t want to be reading the Nix language manual at 2AM trying to figure out why the nixos-containers module isn’t forwarding this port or whatever.
- Nothing is configured out of the box, build up the machine from scratch.
I still really want to like it. It could be great with some more investment into learning the tools, but then I’m dooming the next guy (or future me) who doesn’t have that knowledge. Will probably be awesome in 10 years.
Proxmox
I’d use a configuration management tool to help configure the hosts and VMs. I know Ansible pretty well already so probably that.
Pros:
- LTS Debian host
- Fairly widely used and well documented
- Good web dashboard for managing multiple hosts
- Easy support for Ceph
- Support for HA clustering
- Good support for ZFS snapshots and replication
- Built-in support for OpenID connect auth
- Built-in options for monitoring, notification, backups
- Ansible provider for configuring proxmox things
- Good web remoting into VMs. At least useful for configuring the Windows Excel VM, and for checking on things if I break SSH.
Cons:
- Clustering is not super well documented, requires 3 hosts, inflexible
- Very much in the vein of manually configuring things using web GUI or CLI. You can use Ansible to configure everything but then the web GUI is just a half-useless temptation to screw with things.
- If I use the proxmox ISO then things come configured in a specific way outside of the configuration management system and now I have to understand what they did and how to work around it.
- If I manually install it on top of Debian with Ansible I still need to learn how they want everything to be configured, and now I’m not saving much work over just setting up a Debian machine myself how I want to do it.
- The web dashboard doesn’t display all the stats I’d want to monitor so I’d still need to set up a custom one with Grafana et. al.
- The replication, live migration, etc. features look nice but they wouldn’t work for the DR scenario where I need to failover to the offsite replica if the master server has already failed and is inaccessible. Would be nice for installing kernel updates with no downtime though.
If I were being really pragmatic I think I’d just install Proxmox and do things the old fashioned way. Just use Ansible (Terraform? Something else?) to provision the VMs and don’t do anything fancy on the host. I’m just generally wary of these integrated “easy to use” appliances since they never do everything I want, it’s annoying to bend them to my will, and they’re another layer of complexity to learn and figure out rather than building it myself so I know how it works.
XCP-ng
Pretty much the same pros/cons as Proxmox with some extras
Pros:
- Xen hypervisor so we can’t do anything funky on the host.
Cons:
- Xen hypervisor so we can’t do anything funky on the host.
- Less popular and documented than Proxmox.
- Not integrated with ZFS, wants you to use XOSAN?
Haven’t looked into it too deeply but if I were to use this type of tool I think I’d use Proxmox instead.
Kubernetes
Probably with something designed for smaller use-cases like MicroK8s.
Pros:
- Lots of industry momentum and support.
- Made for the whole infrastructure as code, CI/CD containers thing.
- Lots of tools built around this for networking, configuring, secret management, authentication, etc.
Cons:
- I have no experience with this and am intimidated.
- Everyone talks about how over-complicated Kubernetes is unless you’re Google sized, which I am not.
- All the HA stuff is based around the traditional 3+ node voting quorum.
- I don’t really need the auto-scaling auto-migration many-hosts bin-packing microservices stuff.
Linux+?
Just build from scratch and use one of the following tools to configure it. Probably Debian. Also considered Ubuntu (not a big fan of Snaps), Fedora with RPM-ostree (want longer support), CentOS alternatives (still getting built).
+Ansible
Pros:
- I already know it decently well
- Agentless, masterless design - can use to build up servers from bare installs with no prerequisites (beside SSH)
Cons:
- Templated yaml with python expressions
- Many complex ways to write the same config file
- Slow
- Not declarative
- Can be hard to make custom things idempotent
- Distributed with pip, doesn’t work in Windows
- Combine with something else to create Container, VM images?
+Nix
Pros:
- Could install Nix on Debian to get a longer supported kernel and a way out if Nix isn’t good for something.
Cons:
- Don’t get to use Nix to configure the whole system
- If you’re not using it to configure the whole system you lose out on a lot of the benefits, and probably need another tool for configuration as well
+HashiCorp (Terraform, Packer?, Vault, Nomad?, Consul?, Boundary?)
Pros:
- Their tools seem pretty well designed overall
- Terraform is Agentless, masterless, and declarative
- Should be pretty well integrated to deploy multiple of these tools
Cons:
- Have to learn Terraform
- Terraform is very much focused on Cloud deployments. I don’t think it’s useful for configuring the host machines? Could use e.g. the libvirt provider to configure VMs though.
- Enterprise only features, do I care?
+OpenStack
My perception has been that OpenStack is very enterprisey and complicated. Haven’t looked into it deeply, need to do that. The documentation is not very clear.
Question
Whether you read all that or not I’ll leave you with this question: If you were starting from scratch with a few on-prem servers, what would you use to provision and configure them so that you can understand and control them from one place?