Over the past few years I have built a self hosted setup consisting of an unraid server, a few pis, a smattering of IoT stuff, a few linux and windows machines, and small ubiquiti network. I ignore it most of the time. My unraid server sends me notifications for itself, but everything else…out of sight, out of mind.
I’ve never really had any issues, but I know I’m living too dangerously. I’d like to get in front of this and start learning to properly keep tabs on how all my stuff is running. Can anyone point me in the right direction? I have looked into syslog servers a bit, mostly because that was the only specific term I knew to punch in to google. I got the feeling that they are but one tool in a broader toolbox, and I’d like an overview of the toolbox.
The answer is sadly, you’ll need to host more stuff and frequently there’s pyramids involved.
On the top of the needs pyramid you have yourself, ofc.
Next layer down you have notification methods e.g. email, text message, phone call and a bunch of other noise making apps, phone notification popping apps.
next layer down you have a thing that evaluates various rulesets and continually applies some kind of logical operations over your monitoring metrics to determine when stuff is out of whack. e.g. Prometheus
next layer down, there’s a thing that allows you to do this kind of stuff adhoc e.g. Grafana
next layer down you have your monitoring storage, things like syslog servers, influxdb, tsdb, elk stacks .
next layer down you have your agents - sources of events and metrics.
There’s a gazillion companies out there each with their own “best practices”, they’re all competing for mindshare more than for actual market. None off them are good, free, and simple.
One key insight I believe you should keep in mind is that “monitoring” does not make your stuff more reliable, and it does not make your life simpler. It’s an offering that increases complexity in your stack, and the amount of stuff you have to know about and maintain (ie. you’re hosting more stuff and doing more with phone apps and things).
It’s only worth the time and effort is the alternative of not doing it is even more expensive. Whatever you deploy in order to be happy with it, you’ll need to make relatively “lightweight” and “simple”.
For example: let’s say you have Home Assistant running managing your iot smothering of stuff, it has an option to log stuff to influx db, and you can setup grafana to alert you via email when stuff is broken. Good stuff.
How do you know alerts are working?
Typically, you’d have it send you an email a day, if you don’t get it, you know stuff is broken. This requires vigilance.
But, once you have bootstrapped your “monitoring” with an email a day, you can do so much more with a bunch more servers and infrastructure that can go and deploy itself/heal itself on kubernetes only requiring your attention for truly bad breakages.
Another good monitoring bootstrapping thing. Have a shell script running from a cron each day that ssh-es into hosts and emails you their uptime.