So as you’re all no doubt aware, SolarWinds have had a number of issues.
As a customer, I’m looking at alternatives, because the level of stupid currently being exposed is to be blunt, horrifying. I feel they had a leading edge product that has stagnated, the brains moved on/left, and I simply don’t trust the company any more.
I’m looking for something capable of monitoring up to a few thousand nodes (switches/routers/hosts) for up/down/performance monitoring.
I’ve run Zabbix a long time ago. It’s on my short-list, but never dealt with their support, and I suspect the product has changed a lot since 2012. I’m currently running Orion NPM and have been for a decade or so.
What are others running? What do you like about the solution? Is commercial support available? Ideally, I’d like to stick it in Azure so that it is going to work independent of my HQ site going out (we’ve got around 30-40 small sites).
Grafana as the dashboard frontend.
Prometheus as the query engine.
Node exporter for default host related metrics.
snmp exporter for stuff that can’t run the node exporter.
Alert manager for prometheus to send email / irc messages / other webhooks.
Promscale/Timescaledb/Jupiter/pandas and sns for multi-year storage and data science / offline-ish analysis stuff.
You’d start with a node exporter on one host and a single central instance of prometheus setup to scrape it.
Next up, grafana.
Next up, a second prometheus.
And so on…
Depending on how many timeseries you have and how dense they are and what kind of rulesets/synthetic timeseries your computing on the fly … prometheus can scale up to 1k nodes with about 10k metrics each probably very easily. You can always run more of them and split the instances e.g. for network stuff vs host stuff. And there’s people out there who run them federated, so you can cross query multiple prometheus instances with arbitrary expressions and e.g. form cross organizational alerting rules. It’s unlikely you’ll need it.
There’s a bunch of independent shops and contractors offering meta-monitoring support and or observability services for various systems, prometheus included. Sometimes you can contract them for sysadmin work too.