If not absolutely necessary, I’d start virtualizing all bare-metal. There could be a reason for physical servers (like large databases), but if there’s none, get going to slap everything into VMs, one by one. You can use Clonezilla, ReaR (Relax-and-Recover), or just dd
into gzip
over ssh. If there’s a lot of machines, you might want to look into automating the process a little with backups.
Following that, I’d start taking snapshot of everything in their “working” state (if you have databases, make sure these are shutdown when the snapshot is being taken, otherwise you’d get inconsistent DB state and basically be unable to start the thing).
For bare-metal, if still present, boot into a live environment, mount the disks anywhere and do a normal offline-copy of the FS somewhere - that way, you can just delete the things in it and place the old data, then just reconfigure the bootloader.
When all that is done, start patching or upgrading. Depending on what the business was running, you might have massive issues when you upgrade and have to roll-back to the snapshot or copy the backed up data back.
IDK what the backup policy was for the env you inherited, but most places I’ve seen have some kind of backups implemented, but usually at the program level (backing up the binaries and the data), not at the OS level, meaning that if you upgraded and you couldn’t make the software to work, you basically had to install a fresh version of an older OS.
For automation, people swear by ansible, like ITT, but when you’re under pressure to show quick results and don’t have time to learn other tools, go with something like pssh
or dsh
. I’ve used pssh plenty of times to automate some tasks on different OS. Just make lists like “all-servers,” “suse-servers,” “centos-servers” and so on (all text files, you can use that as input for server list for pssh). The reason I like pssh is because of pscp, where you can just send scripts to all of them and run them locally (instead of running one command per ssh connection). And the logs of the output stay on your machine, so you can see what machines had trouble running some commands and correct them.
Following that, I’d take the longer game and migrate everything to a single stack for easier management. Start with the ones that have the least amount of servers. Say if you only have 5 alpine servers, start with these, move them to something else. Then 10 opensuse, go with that and so on.
K8s is cool and all, but the learning curve is steep and there’s a lot to take into account (persistent volumes, on which nodes you want the pods to run - although that’s mostly automated, namespaces for cross-pod communication, port-forwarding to access the pods’ programs from outside, unless everything you have can be managed by something like traefik or nginx etc.). I’d say keep it in mind when taking decisions to what to migrate the infrastructure to.
That’s not the kind of mental illness on the forum that Ryan was thinking of, lmao!
That’s the reason I haven’t left mine yet. When everyone’s firing left and right, it’s a tough decision to quit. Although I’d like to focus on some personal matters, but I think I’ll take time off for that.
Back to server management, I like the idea of “doing things right,” but that usually needs a lot of time to get implemented, so I always take the fast-paced approach, automate everything with what tools you have available, even if you’re going to use that tool just once (from experience, the scripts or part of them might come in handy later on). I take a pragmatic approach of “do what you can now and take time to study how to implement things properly later.” The best example of it would be trying to implement an enterprise-grade backup solution, but then falling victim to a ransomware because you didn’t even have some file-level copy of your data on a NFS or S3 bucket.
That doesn’t mean you should rely on the tools and stacks that are working now, you should actually take the time to improve them, when the situation is less critical.