Help me decide how to provision some small on-prem infrastructure

Background

I’m at a small business that’s running off of a Windows Server 2016 tower that the boss put together (he even modified the case to fit two AIO watercoolers in). It’s got loads of random software running on it, it’s running out of storage space, Windows updates aren’t applying cleanly anymore, I don’t know much about Windows administration, and yes we had a ransomware attack. The boss is paranoid about the cloud but I convinced him to order two new rackmount machines and I’ll be migrating things over to some sort of Linux server infra. I’m a software dev so I have some knowledge of administration but not very deep. I don’t have much experience with all the various software options so it’s hard to evaluate which are best for our needs. Wanted to ask y’all what you think. (Also just writing it all out will be helpful anyway).

New Server Specs

Two identical servers. One is (offsite?) backup/spare/failover.

  • Supermicro A+ Server 2014CS-TR - 2U - 12x SATA/SAS
  • AMD EPYC 7543P Processor 32-core 2.80GHz 256MB Cache (225W)
  • 8 x 32GB PC4-25600 3200MHz DDR4 ECC RDIMM
  • 2 x 800GB Micron 7300 MAX Series M.2 PCIe 3.0 x4 NVMe Solid State Drive (Boot drive, VMs and DBs)
  • 2 x 280GB Intel 900p Optane SSD U.2 in PCIe-U.2 Risers (ZFS SLOG and Metadata vdev)
  • 6 x 14TB SATA 6.0Gb/s 7200RPM - 3.5" - Ultrastar DC HC530 (512e/4Kn)
  • Supermicro AIOM OCP 3.0 - 2x 10GbE SFP+ - Intel XL710-BM2 - PCI-E 3.0 x8 - AOC-ATG-i2SM
  • Supermicro AOM-TPM-9665V-S - Trusted Platform Module - TPM 2.0 - X10 X9 MBs - Server Provisioned - Vertical

Workload

Typical small business stuff. Nothing particularly difficult.

  • Nextcloud instance for ~10 employees
  • WordPress site that sees no traffic
  • Node.js web application that services a few dozen internal and external users
    • Webapp connects to Excel running on the windows server host to do some excel processing (ugh, will migrate to a Windows VM)
    • Uses MariaDB database of moderate size (3.5GB SQL dumps growing ~200MB/mo)
    • Accesses a few million small Excel and CSV files for processing.
    • Streams video files (~11TB of video files Growing ~1TB/mo)
  • Nginx reverse proxy

There are a few more services I’d like to run but couldn’t be arsed to install them into our current mess.

Infrastructure Software

There’s a lot of ways to manage some servers with some VMs and containers, and it’s hard to choose between them. Let me emphasize what I’m specifically concerned about.

Goals

  • ZFS Filesystem for storage. I’m already familiar with it. Snapshots. Send/receive. Transparent compression. Metadata cache to speed up the webapp finding random files.
  • Simple, reliable, (immutable?) host OS/hypervisor. Reduced attack surface. Long term support. Almost never reboot the host machine.
  • Infrastructure as code. I want to be able to apply the same config to both servers. I want to have a single source of truth for all configuration for all software on the hosts and inside containers and VMs. I want to avoid coming back to the server that I haven’t touched in months and try to remember which config file I edited or whatever. All the config lives inside one git repo with commit history. When I leave, someone else should be able to look at the repo and understand how things work rather than going spelunking in VMs trying to find every little necessary config change that needs to be done for it to work.
  • Good monitoring dashboard and alerts. Anticipate hardware failures. Solve problems quickly.
  • Ideally Active/Passive failover to the offsite backup host using e.g. CloudFlare DNS and host monitoring.
  • Test environment for testing changes before deployment.

Options

I’m looking for something that’s between the old-school manual hand-holding of a few servers and the new big complicated enterprisey ways to manage thousands of servers. Here’s what I’ve considered so far:

Expand for thoughts about the different options I've considered

NixOS

I’ve been playing around with this and it’s more usable than I expected.

Pros:

  • Single configuration language declaratively specifies packages, configuration, etc. for an entire system
  • Atomic upgrades and rollbacks for packages and configuration
  • Declaratively specify containers and VMs with the same system. Move services between the host and containers/VMs with ease.
  • Tools like NixOps for provisioning many machines with shared configuration.
  • Availability of up-to-date packages.

Cons:

  • No long term support for the host system. Need to upgrade yearly.
  • A niche and somewhat difficult system to leave behind for the next guy.
  • No professional security team for packages. Security patches may take longer to release than other distros. (You can “easily” make a custom version of a package which includes the security patch if you stay on top of things but I haven’t tried this yet and I’m scared).
  • A lot of change in the ecosystem with things like flakes and changes to containers and packages in flux.
  • Poor documentation and complicated command line interfaces.
  • Things are fun and easy until you stray off the beaten path or something breaks and now you need an intimate knowledge of both nix and the software its configuring. Don’t want to be reading the Nix language manual at 2AM trying to figure out why the nixos-containers module isn’t forwarding this port or whatever.
  • Nothing is configured out of the box, build up the machine from scratch.

I still really want to like it. It could be great with some more investment into learning the tools, but then I’m dooming the next guy (or future me) who doesn’t have that knowledge. Will probably be awesome in 10 years.

Proxmox

I’d use a configuration management tool to help configure the hosts and VMs. I know Ansible pretty well already so probably that.

Pros:

  • LTS Debian host
  • Fairly widely used and well documented
  • Good web dashboard for managing multiple hosts
  • Easy support for Ceph
  • Support for HA clustering
  • Good support for ZFS snapshots and replication
  • Built-in support for OpenID connect auth
  • Built-in options for monitoring, notification, backups
  • Ansible provider for configuring proxmox things
  • Good web remoting into VMs. At least useful for configuring the Windows Excel VM, and for checking on things if I break SSH.

Cons:

  • Clustering is not super well documented, requires 3 hosts, inflexible
  • Very much in the vein of manually configuring things using web GUI or CLI. You can use Ansible to configure everything but then the web GUI is just a half-useless temptation to screw with things.
  • If I use the proxmox ISO then things come configured in a specific way outside of the configuration management system and now I have to understand what they did and how to work around it.
  • If I manually install it on top of Debian with Ansible I still need to learn how they want everything to be configured, and now I’m not saving much work over just setting up a Debian machine myself how I want to do it.
  • The web dashboard doesn’t display all the stats I’d want to monitor so I’d still need to set up a custom one with Grafana et. al.
  • The replication, live migration, etc. features look nice but they wouldn’t work for the DR scenario where I need to failover to the offsite replica if the master server has already failed and is inaccessible. Would be nice for installing kernel updates with no downtime though.

If I were being really pragmatic I think I’d just install Proxmox and do things the old fashioned way. Just use Ansible (Terraform? Something else?) to provision the VMs and don’t do anything fancy on the host. I’m just generally wary of these integrated “easy to use” appliances since they never do everything I want, it’s annoying to bend them to my will, and they’re another layer of complexity to learn and figure out rather than building it myself so I know how it works.

XCP-ng

Pretty much the same pros/cons as Proxmox with some extras

Pros:

  • Xen hypervisor so we can’t do anything funky on the host.

Cons:

  • Xen hypervisor so we can’t do anything funky on the host.
  • Less popular and documented than Proxmox.
  • Not integrated with ZFS, wants you to use XOSAN?

Haven’t looked into it too deeply but if I were to use this type of tool I think I’d use Proxmox instead.

Kubernetes

Probably with something designed for smaller use-cases like MicroK8s.

Pros:

  • Lots of industry momentum and support.
  • Made for the whole infrastructure as code, CI/CD containers thing.
  • Lots of tools built around this for networking, configuring, secret management, authentication, etc.

Cons:

  • I have no experience with this and am intimidated.
  • Everyone talks about how over-complicated Kubernetes is unless you’re Google sized, which I am not.
  • All the HA stuff is based around the traditional 3+ node voting quorum.
  • I don’t really need the auto-scaling auto-migration many-hosts bin-packing microservices stuff.

Linux+?

Just build from scratch and use one of the following tools to configure it. Probably Debian. Also considered Ubuntu (not a big fan of Snaps), Fedora with RPM-ostree (want longer support), CentOS alternatives (still getting built).

+Ansible

Pros:

  • I already know it decently well
  • Agentless, masterless design - can use to build up servers from bare installs with no prerequisites (beside SSH)

Cons:

  • Templated yaml with python expressions
  • Many complex ways to write the same config file
  • Slow
  • Not declarative
  • Can be hard to make custom things idempotent
  • Distributed with pip, doesn’t work in Windows
  • Combine with something else to create Container, VM images?

+Nix

Pros:

  • Could install Nix on Debian to get a longer supported kernel and a way out if Nix isn’t good for something.

Cons:

  • Don’t get to use Nix to configure the whole system
  • If you’re not using it to configure the whole system you lose out on a lot of the benefits, and probably need another tool for configuration as well

+HashiCorp (Terraform, Packer?, Vault, Nomad?, Consul?, Boundary?)

Pros:

  • Their tools seem pretty well designed overall
  • Terraform is Agentless, masterless, and declarative
  • Should be pretty well integrated to deploy multiple of these tools

Cons:

  • Have to learn Terraform
  • Terraform is very much focused on Cloud deployments. I don’t think it’s useful for configuring the host machines? Could use e.g. the libvirt provider to configure VMs though.
  • Enterprise only features, do I care?

+OpenStack

My perception has been that OpenStack is very enterprisey and complicated. Haven’t looked into it deeply, need to do that. The documentation is not very clear.

Question

Whether you read all that or not I’ll leave you with this question: If you were starting from scratch with a few on-prem servers, what would you use to provision and configure them so that you can understand and control them from one place?

This seems like something that @anon47317651 may be able to help with.

Honestly, we use vmWare and KVM/qemu at the shop that I am at. At home I personally use LXC but that will not work for your MS Windows systems.

Since you are saying that ZFS will be the file system, I would wholeheartedly recommend going with a BSD flavor so that ZFS is truly a first class citizen.

2 Likes

I don’t really have a problem with the level of integration of ZFS on Linux nowadays. I have run FreeNAS before, though, so I should consider that at least a little. bHyve is pretty good these days I hear.

1 Like

I haven’t read everything, but I still recommend Proxmox. The “inflexible part” about requiring 3 servers: you only need your 2 new chunky boys, then have an Intel NUC or and old PC (or even your old Windows Server) to install Proxmox on it and only use it as a tie-breaker. You don’t need to run anything on it other than Proxmox and adding it to the cluster.

The cluster part is not documented because it’s basically 3 clicks: Data Center → New cluster → name it and hit Done on the first one, then on the second and 3rd servers: Data Center → Join Cluster → copy and paste the join info from the first server → write the root password and hit Done. EZ PZ.

I’ll write more in-depth once I read everything. Sorry for the rushed reply.

2 Likes

Ok, coming back to this. In a situation where you need some Windows VMs, then you can’t get away from virtualization. So you are kind of limited to Proxmox / XCP-ng / OpenStack / OpenNebula / oVirt / Virt-Manager etc. Out of all of these, XCP-ng doesn’t seem to meet your needs. OpenStack and OpenNebula are enterprise-cloud solutions, which also include cost-control stuff built-in. You probably don’t need that. oVirt is apparently getting better, but the easiest path is still Proxmox. Some of its shortcomings in your case can be remediated.

Never used NixOS, can’t comment on it.

As for k8s, if you do something with it and need a small stack and want to avoid microk8s because Ubuntu, then your only lightweight alternatives are k3s and k0s (with the former being the most popular and documented). I have no idea how you’d be supposed to do a cluster in Kubernetes with just 3 hosts, because K8s has a master / slave architecture. I think you may be running the master plane and the worker plane on all 3 nodes? Which is not recommended… To get around that using 2 servers and due to the 100 pods restriction of k8s (which can be increased to ~200-250), you would have to use LXD to split the Epyc servers into multiple nodes. So K8s becomes a mess really fast. Add the fact that you also need Windows and just pure K8s goes out the window (pun intended).

Bonus points for Proxmox is that you can use LXC and use containers for things like NextCloud, WordPress and Nginx, so you save some resources, but the downside is that you can’t live migrate them, you need VMs for that. But in case a host fails, they will start back up if you have HA enabled, just that they will be “restarted.” Just don’t use LXC for DBs (they will work, but I read that it’s not really recommended, I’ve yet to understand why, it’s usually avoided because people did stupid stuff like failed to backup or make persistent storages for OCI containers).

Again, you don’t need 3 hosts for Proxmox. The current 2 ones will work just fine, you just need one more in order to break the tie. The tie-breaker helps hosts to decide if they are the ones who are down or the other server. If one maintains communication with the tie-breaker, it assumes the role to run the VMs from the HA.

I just realized you only have 2 servers, 1 on-premise and 1 off-site. Oh, bummer. Well, I’ll tell you from my experience (not proud of it), but you can run services without HA and they will work fine, especially on unused hardware. However, you need a lot of bandwidth between sites, 1Gbps barely cuts it if you want HA.

To answer this, it’s still Proxmox, because on a small scale, it’s doing a great job. Even at a larger scale it is fantastic.

But, coming back to your workloads. You mentioned

To how many people is it streaming? This may be the only limitation, where you may need that CPU if you can’t balance / offset it with GPU acceleration (transcoding).

Up to about 40 users total, I would probably downscale the Epyc CPUs into 3 virtualization servers (one can be just there to be a tie-breaker), 1 NAS, 1 local backup server and 1 off-site virtualization and data backup server in a pinch (like basically for Disaster Recovery / Business Continuity Planning). I would spec them like this:

  • Virt srv (the 2 ones):
    • 6 cores (6 / 12 threads)
    • 64-96 GB of RAM
    • 2x120 GB hot-swappable SSDs in RAID mirror for the OS
  • Virtualization tie-breaker: basically any old PC, but Intel NUCs are good for their low-power consumption and small form factor. You can use a Dell 9020 USFF or something, doesn’t need to be anything special, just to not fail on you.
  • NAS: depends on your storage requirements, but growing 1 TB/mo means you will probably need an upgrade in a few years (but it’s not recommended to keep a server more than 3-5 years anyway). So:
    • 6 cores (6 or 12 threads)
    • 256 GB of RAM (for ZFS, otherwise you can get away with 32 or 64GB)
    • 2x120 GB hot-swappable SSDs in RAID mirror for the OS
    • 4x14TB 7200RPM HDDs in RAID10 (22TB should last you 1 year and a little basically, then you need to add more storage, so you can just buy 4 more 14TB drives and add them to this server)
    • and 4x512GB SSDs in RAID10 for the VMs OS and any additional IOPS needs (maybe you can run your MariaDB server and store your Excel and CSV files only on the SSD array, you may need 1TB drives instead depending on your storage requirements).
  • local backup server:
    • any 4 core CPU
    • 8-16GB of RAM
    • 2x120 GB hot-swappable SSDs in RAID mirror for the OS
    • 6x14 TB drives in RAID-z2
  • remote server
    • 24 cores
    • 256-512 GB of RAM
    • 2x120 GB hot-swappable SSDs in RAID mirror for the OS
    • 4x16TB 7200RPM in RAID10. Enough to keep all the data from the HDDs and the SSDs from your NAS.

You can downscale to your preference to save some costs. For example, you can get away with 4 cores (8 threads) virtualization servers and NAS and downscale the remote server to 16 cores (32 threads) and if you are willing to take a risk, remove the local backup server entirely (aside from not complying with 3-2-1 best-practice, I see no reason why not, you still have the remote backup if you need it). But again, it depends on how many people are streaming videos from your servers.

Also noteworthy to mention is that you don’t need to mess with ZIL and L2ARC at this size, your drives will offer sufficient speed by themselves, so you can save some money by not investing in too many SSDs. Also, SATA SSDs are completely fine, you should care more about IOPS and consistent performance than the raw throughput of the interface, so you don’t need m.2 nvme SSDs.

2 Likes

uhhh…seems you missed some ZFS developments… read ZFS - Wikipedia and weep…

No, I did not. There is still some work to do on the Linux kernel side. More importantly, there are some features that will remain out of tree for the foreseeable future due to licensing. Even if ZFS on Linux is the main development branch now, those features have a higher likelihood of making their way into BSD proper before the Linux Kernel.

With that said, you choose ZFS for stability, not new features. ZFS has been proven stable on BSD and has BSD integrations older than decades. I am not saying don’t use ZoL, I am saying for a production system, don’t risk someone else’s data experimenting with the new hotness.

1 Like

I’m going to go probably against the grain here and suggest 2-3 servers: 1-2 hyper-v hosts and a backup media agent server.

This guy is running a business and while ZFS is nice etc. he will be able to get help with windows servers from anywhere (eg if you are hit by a bus or on holiday etc.), and he’s running windows anyway; if you’re running windows VMs hyper v itself is basically free.

Yes you can install it as command line only to minimise attack surface.

What you’re proposing will be a full time system admin role to maintain.

You’re a dev so like to build things but this sort of smaller deployment doesn’t really justify making it more complicated than it has to be imho.

If and when he finally figures out he should probably migrate a bunch of that shit to the cloud, migrating from hyperv or VMware is fairly trivial.

Also. When you start running into performance issues (and you will at some point), having a single vendor to deal with for both the vm host and guest level will be a lot less painful than dealing with different performance monitoring tools and metrics between two or more different vendors for both the host and guest.

2c

2 Likes

I agree mostly with what you said except this ^

At my old workplace, for development and random stuff, we had 3 old PCs (an Optiplex 360 and a 790, alongside a custom build with a FX6300). They all ran XenServer. They were clustered together. I had a wiki, a Windows 10 test VM and an Ubuntu server on them. I was part of the helpdesk team, but the dev team (about 7 people) made use of XenCenter to make new VMs and test environments more than we did. Basically nobody actually maintained them, because after XenServer 7.2 or 7.3, Citrix changed the licensing, so no more updates, but we weren’t updating that frequently or checking for hardware errors anyway. This was disposable infrastructure.

In their case, they would still need something more permanent and backups, since they have a production environment. But I don’t think a dev team wouldn’t be able to manage Proxmox, they don’t need a NOC or a sysadmin team to monitor it.

Just wanted to throw my two cents in here. I’m using Openshift on oVirt and its pretty snazzy. Creating vms on oVirt is pretty easy, and the Openshift installer automagically handles oVirt infrastructure. If you only have two nodes, you could create a single node openshift cluster twice, and set one up to be a back up with RH ACM. Installing ACM should be as easy as clicking on the ACM operator in the OperatorHub, but I haven’t used it so I can’t say for sure.

If I was going to start my infrastructure over I would do baremetal Openshift simply because it makes it easier to do things like KubeVirt and Openshift Container Storage. And there are Openshift migration toolkits to migrate regular VMs to Openshift VMs, but again, I haven’t used them so I don’t know how seamless they are.

Openshift/Kubernetes does have some learning curves, but things like Helm simplify it a lot, and it does seem to be the way of the future, so it’s probably good to learn anyway.

2 Likes

Hyper-V is nice and easy to use. The problem is that he wants to run ZFS. I would trust my life to ZFS before I trust MS Windows any any of its supported file systems.

Even with ZFS, this. Everyone is gansta until the realized that they would have saved time and heartache by having dependable backups.

Yeah, but I think people think too much “OMG ZFS I NEED IT” - and put far too much priority on that rather than other more important things first.

When they should also be more concerned about having actual off-line, off-site backup with say, a tape unit.

ZFS has its own issues to deal with, especially in a virtualisation context - performance related, you really need to be educated on it to make it work effectively. They’re not insurmountable, but if you’re dealing with a small or no dedicated IT team, you’re just creating more problems for yourself and making your environment less supportable.

I work in a $1bn+ turnover company with a bunch of dedicated IT staff and we’re not running ZFS for production workloads because of this. We’ve been fine with NTFS or other non-ZFS filesystems for the past 25 years.

I’m not saying ZFS is bad. I’m just saying just because you have ZFS, it isn’t the ideal fit for every problem. Maybe use it as a backup destination and use it for ZFS-ZFS replication off-site. But adding it to the mix for VM hosting is just going to limit and complicate your support options when its overkill for the application.

2 Likes

Same. The only time I have had to work with ZFS was when we had legit Sun Microsystems servers and they supported a specific purpose. Everything else was running either RHEL or Debian with XFS, one MS Window system. and one MacOS server.

2 Likes

Really appreciate all the time y’all have spent responding to this. Going to put my responses in one post.

Re: Hardware

Some good thoughts about this but alas, the servers in the OP have already been ordered. It seemed overkill to order more machines when fewer bigger machines would do the job (and these aren’t particularly big machines). May have been a mistake given how much software is built to be reliable using more machines.

We will still have the old server, of course. It’s a ca. 2016 dual 20-core Xeon tower with 512GB ram. Was already planning on putting linux on it and using it for something, once it’s no longer needed.

Also wouldn’t be opposed to buying more machines or buying a cloud instance somewhere, especially if it’s just for voting.

Re: Video Streaming

We’re currently just using HTML5 video tag embeds of the raw video files, no live transcoding. Have considered re-encoding them to save some space but that would be a batch job to soak up the large amount of idle CPU we have sitting around already.

Re: High Availability

The failure modes I’m concerned with are:

  1. Power outage at office. These have been happening fairly frequently. I did build a very beefy UPS which should be able to handle 12 hours of average load.
  2. Internet outage at office. Our fiber line here has only been ~99.99% reliable so far, judging by the router logging failover to our old cable connection. Lots of construction, digging, trenching around here so cable cuts are rather likely.
  3. Hardware Component failure. There’s a fair amount of redundancy in the servers so I don’t think it’s very likely to take us down. However if it did, it could take us down for days waiting for replacement parts.
  4. Theft, Sabotage, Flood. Our physical security is not great, but on the bright side there are easier things to steal? Also near a large river, I don’t think it’s likely to flood but who knows.

With those in mind here’s what I’ve been thinking about the possible solutions.

3 Node Quorum

This would allow using the industry standard tools for HA, most of them seem to work this way. Proxmox cluster, etcd (Kubernetes), Gluster/Ceph, MySQL and Postgres have options that work this way. So in some sense it would be the easy path. However, my perception through the grapevine is that those things can be difficult to get working reliably. From what I recall most of those systems are designed to have a reliable quorum inside a single datacenter and use a separate method for replication between datacenters. Not an issue when you have 10k nodes in one colo with reliable network and GPS clocks, but I’m worried about strange behavior in my… less rigorous situation. I’ll have to look more thoroughly at the options, but is this an incorrect perception?

I’m also worried about inconsistency arising from certain failure modes. There’s a lot of data uploaded and edited directly to the server at the office, and also edits and uploads remotely through the webapp. In the event of an internet connection failure at the office there could be modifications made on both sides of the partition. I don’t think any of the above systems handle this gracefully, the partitioned node would have to go fully offline which is maybe acceptable but not ideal.

Active/Passive Failover

Just two machines. Files are sent from the active to passive using ZFS send/recv. Databases are replicated maybe just using ZFS send/recv (apparently that can work?) or by running a read-only replica on the passive. In normal operation the passive is just an offsite backup. An external service such as CloudFlare directs traffic to the Active machine if it’s working, or the passive if it’s not. The services on the passive are all read-only to prevent data consistency issues in the event of a transient issue, but still provide some level of service. In the event of a longer outage I can manually reconfigure the passive machine to become active and accept writes. This would have to be set-up beforehand with whatever configuration management tool and tested, of course.

Having to manually handle failover is not ideal but I think it is quite practical for our situation. Our external customers and employees in the field are mostly read-only, while our office users are doing most of the writing. In the event of an office internet outage a high level of continuity is maintained automatically. There should be time to evaluate the situation and decide on a course of action. (How long will the active be down for? Is there important data recently added to the active, or can we just revert to the last snapshot?)

Re: ZFS

I do actually agree that ZFS is somewhat over-hyped (don’t tell Wendell) and there are many other storage solutions that would work, especially at this size. In a 3-node HA setup, Gluster/Ceph are the obvious choice that would obviate many of ZFSs benefits.

However, I’ve already spent a lot of time learning in depth about all the quirks related to ZFS administration so that’s not a concern. I’d really prefer to stick to what I know so I don’t have to spend that time again learning something else.

Re: Software & Provisioning

I’m generally coming around to the idea that having a traditionally managed host system is okay. All the important services should be in the Containers/VMs which have good tooling for IAC type stuff. You can look at the configuration for the containers and understand most of how the services are configured. If you’re doing disaster recovery or migration, you can “just” take the containers and run them somewhere else.

Re: Proxmox

I still need to do an in-depth evaluation of this, I couldn’t get it to run in Hyper-V so I’m waiting for the hardware to show up. It’s definitely a strong contender. Having a GUI for managing simple monitoring, maintenance, and configuration tasks would be nice, especially for others at the company. Proxmox does seem to have one of the more feature complete GUIs of the bunch.

I’m just a bit wary of these types of integrated tools for perhaps not 100% pragmatic reasons. I have a pretty good understanding of how the underlying system and services work, and there’s a lot of resources and documentation for managing Debian and StrongSwan and nginx and whatever. Then you layer on top some bespoke vendor middleware which is editing config files and running commands in ways which are never going to be as well documented as the underlying systems, and they add their own idiosyncrasies that I have to learn. On top of that there’s the GUI which is often slow, exposes few options, and is bad at reporting the status of those underlying commands especially if they fail.

Those impressions were made by FreeNAS et. al. so maybe Proxmox is just better, I’ll have to see. At least it is just Debian underneath so it seems to be more open to understanding and modification. I’d still be a bit worried about making changes outside of what the middleware understands and breaking the GUI or breaking updates. I guess the answer is just don’t ever do anything outside of what’s supported in the GUI (and hope you never run into a bug that requires deeper understanding of the system).

Re: Windows/Hyper-V

I did think about doing a minimalist Windows Server install just as a Hypervisor. It would certainly be much more secure, and it has options for a lot of the features I’m looking for. However nobody here is any good at Windows Server administration, and I feel like it would be a big process to learn how to do it properly. A minimal setup would be straightforward but then what’s the value-add over any other Hypervisor? If we start using it as an Active Directory domain controller or whatnot it has more value but also much more sysadmin overhead to do properly. Also the licensing makes my head hurt and if we’re going to pay $5000 for a license it better pull its weight. The Windows VM is a really small thing we can probably eliminate so I’m not concerned about running it in KVM.

Re: SmartOS

Yeah nobody mentioned this but I want to nerd out a bit. SmartOS is so cool. Minimalist system services running in a immutable container. High performance native containers with a high level of isolation. Also supports virtual machines with the same tools. Good support for docker, docker-compose, terraform, ansible, etc. to manage the entire system and the jobs it runs. Triton and Manta. Extensive monitoring tools. ZFS snapshots for system image, system config, VMs, containers, etc. automatically managed. Revert to a previous config in the boot environment if there’s a failure.

However it’s probably not a good choice due to: limited hardware support, poor documentation, small community, concerns about Joyent’s commitment to opensource support, and many bespoke tools which would have to be learned. Will probably still play around with it a little, just to see.

Conclusion

Definitely leaning towards Proxmox. Don’t do any funny business on the host and it should be reliable and foolproof enough.

Thanks again!

2 Likes

It’s correct. You can’t have reliable HA between multiple data centers, unless you literally have a few fibers connecting them directly. This is why replication exists.

Highly not recommended. You can easily just dump and import automatically.

True, only because you don’t seem to have a lot of services running. Also 99.99% is basically almost 100% uptime. Everyone takes that .01% margin, which is about 2 weeks total / year in case something goes horribly wrong, but it’s rare for that to happen. Not sure where I read that, but most companies only see an average about 4h of downtime total / year (formed from multiple minutes each week or so) and very few have more than 1 day downtime.

1 Like

You seem to have a solid understanding and a relatively solid plan. you have my thumbs up over all, but just remember, if you want something that the next guy can come in and take over relatively easy, script, automate, and document. Also, keeping the host as simple as possible ensures that you can better recover after a full blown disaster. you do have a few natural environmental risks that you need to keep in mind.

Recommend working with your CIO/CISO to come up with a Risk assessment and a Business Continuity Plan to see what risks the company is willing to take and what they are willing to spend to help you come up with a final solution. Sometimes going with an option that has paid support is better insurance than saving money up front because free.

2 Likes