Linux Admin best practices

Hello forum,

so I am working as a sysadmin for 15 years and rn my focus lies more in the Microsoft tech stack
I just inherited a linux environment of 104 servers (most of them in vmware, some bare metal and a small kvm environment) which I am going to take care of in the future and I am a bit shocked since the former admin did everything “by hand” with a more hobbyist approach so it does not feel remotely like I am working in an enterprise environment besides an oracle cluster which is completely managed by an external entity.
So here I am asking for a bit of guidance from you since I feel a bit lost. Right now I am busy doing an assessment of what I am working with. It is a wild mix of ubuntu server, centos, opensuse, alpine. No containerisation is being used for every application that needed a Linux based OS there was just a server installed and NEVER maintained unless something broke.
So right now I am looking to define some strategy how to consolidate this mess.
Are there any gold standard tools for management/automation and what are the pitfalls that come with them? Oh god feels like am asking for free consulting but I know there are a lot of knowledgeable heads in this forum which might be able to help me to avoid some grave mistakes.
I am up for the challenge since I have some sort of creative freedom in unfucking and rebuilding this mess but since my experience comes mostly from what you pick up working in the industry and my homelab which is all linux based.

Cheers

2 Likes

You’ve got your work ahead of you; I hope they’re paying you well.

You’ve got two paths. Depends on how much bandwidth you have to make improvements to the infrastructure vs doing break-fix or new features.

First path: keep things working as is, make it pass a security audit.

Second path: Keep the systems patched, start migrating them to containers or a more modern setup.

Start with information collection. Figure out what’s running on each server, make a table of name/ip, admin username, distro, features/role. While you’re SSHing into the servers, install a new SSH key. Make sure it’s the same one, this will make future steps much easier.

Given that it’s completely unmaintained, I’d start with Ansible. You can make an inventory of systems, organize them by distro, then start writing some roles that do basic stuff like patching, enforcing security policy, etc… based on distro. Start with just doing your patching. Also, check what version of the distro each one’s running, it’s very possible some of these VMs are out of support. In that case, you’ll need to plan release upgrades. You should also move off of centos ASAP, since the whole debacle. I think Rocky has an upgrade path so you can just change the repos and upgrade to rocky.

If you don’t want to deal with release upgrades or migrating, I’d recommend taking those environments over to containers ASAP.

I highly recommend Ansible. It’s an automation and config management framework. The major pitfall is that it’s got a bit of a learning curve, but once you figure it out, it’s super easy. It also has limitations once you get past ~200 hosts for a single execution, but I don’t think your environment is large enough to recommend a different framework.

I’m assuming you’re probably running windows. First thing I’d do is install WSL and the windows terminal, which is different from the cmd.exe that comes with windows. It’s much easier to manage Linux systems from a Linux environment, IMO.


P.S.

Ohh, I completely missed this. I’m sorry, I tailored this a bit more as if you’re completely new to Linux. :grimacing:

Generally speaking, the tools you use in your homelab are going to be pretty much the same as what you run in the company datacenter. That’s the beauty of Linux.

7 Likes

Good post and thanks for taking your time to reply.
Yeah I already wrote a script that tries all the known “general” credentials that I have found to create a user und copy an ssh key which I created and make it sudoer w/o password. Maybe not best practice for now but I need to start somewhere right?

So Ansible seems like a good place to start for me and I am installing it rn. Is it worth to pay for the Red Hat version or will the community release serve me alright? From what I understand the main selling points are versioning and support.

On the container plattform is it really hard getting into kubernetes? For my homelab I was just using docker compose after some issues with portainer and it seems quite the learning curve so I avoided kubernetes thus far. I mean in the future they will most likely expect me to get going with it anyway… Might as well start now.

And they are paying me quite well but I consider leaving since ms cloud stack is in high demand, but I am a bit intrigued by being giving the chance to set this up from the ground up since it might prove quite interesting and I will pick up some new skills that might be valuable later on.

3 Likes

The red hat version provides automation and makes sure things stay in compliance with expected state. Community version is just fine. I’ve used it at every company I worked for since 2016 and it’s never been the paid version haha.

Support is nice, but between Chatgpt, github and this forum (paging Mr. @oO.o) you’ll do just fine. And don’t feel bad about asking for help, we are pathological about solving problems.

Kubernetes is a big learning curve, but once you got it down, it’s much easier. When it comes to managing a bare metal install (as in you’re not paying aws or azure for managed autoscaling and whatnot) I would recommend k3s. It’s a rancher product and is solid. I use it for my homelab and have actually been working on a zero to k3s expert guide, it’s not done yet, but I’ll try to post it here when it’s ready.

Also, if you want to use docker in your company, you’ll need to pay licensing to stay kosher. Kubernetes moved to containerd for that reason.

Stick around and get the experience. Cloud demand isn’t going away any time soon :wink: plus, the economy isn’t too hot right now. Better to stay at a stable place.

7 Likes

Good I am pathological about asking questions :crazy_face:.

Noted looking into K3s and containerd (dont care too much about budget of this rotten place but still…) Since our place is boomer central I am wondering about the challenges I will face when trying to start with containers in regards to networking. I came across https://metallb.universe.tf/ and it looks pretty neat. Any thoughts on that?

True and most like my course of action but at least here in central/northern Europe it is pretty good right now. I get real good offers all the time and the only reason I am not taking them is because I do not wanna move and some neurosis making it hard for me to adapt to new people.

I’ll get started on doing the groundwork next week.

2 Likes

:saluting_face:

In practice, the only reason to pay for Ansible is Ansible Tower afaik, but that is way down the line if you are just learning, and you can always use the AWX which is Ansible Tower’s upstream FOSS version.

Ansible has many quirks but I have written a lot of it so if you have any questions @akarin, feel free to give me a ping.

One hint to start out: if you have any dev experience whatsoever, then in Ansible, facts are variables and variables are functions. That alone should save you some grief…

7 Likes

Oh dang, that’s news to me, awesome!

MetalLB is great for an on-prem system. I use it, and it’s great. I’ve got a reference architecture for deploying a k3s cluster with Ansible on Ubuntu that I can share with you; just need to polish it up a bit, since it’s hacky and kinda tuned for my homelab.

When you get closer to being ready to actually play around with it, definitely let me know and ask questions; there are a few silly things that I find really helpful to disable that come default with k3s (like traefik, it’s really annoying on kubernetes)

6 Likes

I am playing around with it at home and so far it is a pretty nice solution compared to what I have done in the past with macvlans.
So now I have my address pools and assignment of said addresses is working fine in K8s. So in a perfect world I would just use DNS names for the users and devs who want to connect to applications. My network guys are a bit strange and want me to be able to assign specific ip addresses (not just ranges … so static ips like if it was a dedicated server) to specific applications (like in the old days)…

Other then that now I am thinking about the deployment of my “starter cluster” lets call it like that. We have quite the VMware environment so I am thinking about deploying the k8s nodes there BUT just looking at the documentation it looks like I am about to open a whole nother can of worms, or am mistaken?

1 Like

at a minimum Adhere to ITIL practises… this will help you to implement technical controls to meet your operational baseline…
identify operational metrics such as RTO, RPO, or other SLAs which are applicable regardless of linux operating systems or underlying infrastructure i.e. whether you are running bare metal, virtualised or containerised workloads…
select the processes, tools and technologies that help you run a reliable and robust operating environment for the services that you need to deliver

Note : edits in italics

1 Like

You probably want to look into salt or Ansible for automation

You want asses what you’re working with before doing any decisions depending on applications, some might demand specific distros and versions of software (libraries and dependencies).

Depending on what applications you’re are running you actually might want to consider FreeBSD and run your own package server (which is very easy to setup with Poudriere) to easily and rather effortless keep as many boxes as possible in sync utilizing the official ports tree as base. OS updates/patches are also very reliable so that might be one less thing to worry about.

2 Likes

Awesome, I’m glad to hear it!

So my recommendation is this: All applications should go into the cluster via a single nginx ingress controller, operating on a single IP. If that’s not possible, you can set up services to have a load balancer IP and you can assign a specific IP from your LB pool to specific service, but I would push back hard on doing it that way. Explain to them how things work in Kubernetes and hopefully they’ll understand.

Are they trying to use IP based access control or something? I’d also recommend building some sort of SSO tooling with SAML where it’s supported. That can tie into AD very well.

2 Likes

Already set up Ansible and playing around with it, will take a look at salt as well.

I think it is more of a topic to open up connections on the Firewall on so many ports for the ingress IP of the K8s cluster. But as you already pointed out I think they just are not aware of how things are done in K8s so I might show them what I built in the lab so far.

I am a bit sick so lets see how far I’ll come next week. It will take a lot of time to do the initial inventory and figure out how many applications would be worth moving to containers and after that I will decide if it makes sense to set up onprem K8s.

If not absolutely necessary, I’d start virtualizing all bare-metal. There could be a reason for physical servers (like large databases), but if there’s none, get going to slap everything into VMs, one by one. You can use Clonezilla, ReaR (Relax-and-Recover), or just dd into gzip over ssh. If there’s a lot of machines, you might want to look into automating the process a little with backups.

Following that, I’d start taking snapshot of everything in their “working” state (if you have databases, make sure these are shutdown when the snapshot is being taken, otherwise you’d get inconsistent DB state and basically be unable to start the thing).

For bare-metal, if still present, boot into a live environment, mount the disks anywhere and do a normal offline-copy of the FS somewhere - that way, you can just delete the things in it and place the old data, then just reconfigure the bootloader.

When all that is done, start patching or upgrading. Depending on what the business was running, you might have massive issues when you upgrade and have to roll-back to the snapshot or copy the backed up data back.

IDK what the backup policy was for the env you inherited, but most places I’ve seen have some kind of backups implemented, but usually at the program level (backing up the binaries and the data), not at the OS level, meaning that if you upgraded and you couldn’t make the software to work, you basically had to install a fresh version of an older OS.

For automation, people swear by ansible, like ITT, but when you’re under pressure to show quick results and don’t have time to learn other tools, go with something like pssh or dsh. I’ve used pssh plenty of times to automate some tasks on different OS. Just make lists like “all-servers,” “suse-servers,” “centos-servers” and so on (all text files, you can use that as input for server list for pssh). The reason I like pssh is because of pscp, where you can just send scripts to all of them and run them locally (instead of running one command per ssh connection). And the logs of the output stay on your machine, so you can see what machines had trouble running some commands and correct them.

Following that, I’d take the longer game and migrate everything to a single stack for easier management. Start with the ones that have the least amount of servers. Say if you only have 5 alpine servers, start with these, move them to something else. Then 10 opensuse, go with that and so on.

K8s is cool and all, but the learning curve is steep and there’s a lot to take into account (persistent volumes, on which nodes you want the pods to run - although that’s mostly automated, namespaces for cross-pod communication, port-forwarding to access the pods’ programs from outside, unless everything you have can be managed by something like traefik or nginx etc.). I’d say keep it in mind when taking decisions to what to migrate the infrastructure to.

That’s not the kind of mental illness on the forum that Ryan was thinking of, lmao!

That’s the reason I haven’t left mine yet. When everyone’s firing left and right, it’s a tough decision to quit. Although I’d like to focus on some personal matters, but I think I’ll take time off for that.

Back to server management, I like the idea of “doing things right,” but that usually needs a lot of time to get implemented, so I always take the fast-paced approach, automate everything with what tools you have available, even if you’re going to use that tool just once (from experience, the scripts or part of them might come in handy later on). I take a pragmatic approach of “do what you can now and take time to study how to implement things properly later.” The best example of it would be trying to implement an enterprise-grade backup solution, but then falling victim to a ransomware because you didn’t even have some file-level copy of your data on a NFS or S3 bucket.

That doesn’t mean you should rely on the tools and stacks that are working now, you should actually take the time to improve them, when the situation is less critical.

4 Likes

It is not so bad everything that can be virtualized already is. The bare metal stuff exists either because of licence or performance restraints.

everything in vmware (so 90%) is backed up with Rubrik through wmware snapshots. So thats good in case of me braking stuff while migrating/patching.

Well for now management can fuck themselves if they try to pressure me into showing fast results good luck finding somebody else (they had been looking externally for almost a year lmao). I want to do it right from the getgo so it will be very time-consuming to get going but in the long run pay off.
As is things are running “just fine” so there is little pressure. Shit has not been patched in ages so 12 months more do not make the biggest difference aside from critical stuff.

You guys most likely are all US based right?

Most of us on the forum, yes, but we have some Euro and Asian folks (and some Aussies, they actually exist) around. I lived in both Europe and the US.

1 Like

Remember that if you use an ingress controller, any HTTP based services will only need to open 80/443 on a single LB IP.

2 Likes

Hello everyone, I am super excited to join this esteemed forum, I am a passionate tech lover with over 8 years of writing experience in the same field, also having keen interest in IoT. I look forward to sharing and learning from fellow members about RAID Controllers, LTO Tape, Linux etc. I hope we learn and grow together!

3 Likes

Merry christmas and hello again,

so I finally gained access to most system and started to patch some of them that needed it but I still did not have much time to look at the container environment or ansible.
So last week I started to mess around with podman (since guess what docker costs money and money is tight…) and traefik.

So for testing I set up one ubuntu server with podman and created some basic traefik and webserver containers for testing. There is a dns entry: *apps.shittycompany.local with a wildcard cert and at first I pointed it at the server but in traefik for the love of god I did not make ssl work for what ever reason. All tutorials use a much more complicated approach (i just want everything to use the wildcard cert…) so I decided why not use our netscaler (adc) since I have at least some knowledge there to handle ssl. So what I did was to create a virtual server that accepts ssl, pointed my dns entry towards it’s ip and put the podman server in a service using http. So clients now going over netscaler on port 443 and from netscaler with port 80 into traefik. I think this will be the way forward so we still have some security and do not need to expose the server itself to the clients, but only the virtual server on the netscaler. Or would you suggest to figure out ssl on traefik?
So far I am just playing with one host and try to learn as I go by but in the future after my analysis a setup with 3 servers would be sufficient for all our needs right now, since even if we have around 100 Linux servers in vmware, cpuwise they are doing almost nothing all day (another reason to switch to containers maybe). Now when I think about k3s and how to make this env usable for multiple users I feel a bit lost :smiley:

And thats my concluded update…
I would love to play around more with this but somehow I got involved in a siemens teamcenter upgrade and I am losing my mind…

only the Docker Desktop costs money (for businesses). Using the docker daemon does not cost money (personal or business).

If you install the Rancher Desktop you can use a pretty gui akin to Docker desktop and not have to spend money.

Oh god I should have checked that myself and not rely on the info of those compliance people. I feel stupid now.