zabiix alertmanager

Zabbix

Over the years, Zabbix has become the standard monitoring system for host or virtual machines. You can run the Zabbix agent to get metrics for Network, CPU, Memory, Disk usage and begin monitoring and setup simple alerting for your hosts. Even some applications support exporting data to Zabbix. For example, you can integrate JVM monitoring data by running the Zabbix Java gateway or a custom JVM agent, like Zorka.

Zabbix has a hierarchical model. You begin by creating a group of hosts, then you create your physical or virtual hosts. And finally you add them to your group. Each individual host can house many items, which store metric information. One typical item may house actual available disk space. In order to get alerts in regards to low disk space, you simply add triggers to your host. Triggers are configured via simple expressions, like if available disk space is less than 5%, then notify. Zabbix has basic expression language, where you can use avg (which computes the average) or nodata, trigger an alert if no item data was received. The list of available expressions can be found here.

Prometheus

Although it is beneficial for monitoring hosts and applications running in those hosts, it becomes harder to monitor cloud environments like Kubernetes. In Kubernetes, applications change hosts quickly and resource restrictions are configured at the application level, not the host level. This means host checks would show good results, while applications might be constantly crashing due to resource restrictions. By design, Kubernetes applications change their identity on deployment, which makes them next to impossible to organize into a hierarchical model.

This is why SoundCloud originally built Prometheus, an open-source system monitoring and alerting toolkit. Since its creation, many companies and organizations have adopted Prometheus, and the project is maintained by a very active developer and user community. It is now a standalone open source project and maintained independently of any company. To emphasize this, and to clarify the project’s governance structure, Prometheus joined the Cloud Native Computing Foundation as the second hosted project after Kubernetes. Active members from the Prometheus community are also working on open metrics, which evolves Prometheus metrics exposition format into a standard. These specifics make Prometheus a safe choice for any company.

Combining Zabbix strength with Prometheus

So, why did we decide to marry those two systems together?

One of our clients is a leading telecommunication provider with an NOC / Monitoring team, which directly responds to alerts from Zabbix. This company runs a mix of virtual machines and containers running on Kubernetes. Moving away completely from one system to another is simply not an option. Thus, we decided that integrating the systems together was the best solution because it gives the teams the flexibility to try Prometheus without breaking current flow.

During the project, we tried to automate as much as possible.

Previously configuring proper alerts would involve an organizational process like the following:

  • The developer team would expose data via custom integrations like Zorka or custom made bash scripts. They would then inform the NOC team to configure alerts.
  • The NOC team would ask the developer team when to call and what kind of severity the application is. NOC team would then go and configure the Zabbix trigger.

If developers want to change the alerting threshold, severity, or even completely remove an alert, they have to contact the NOC team. This leaves a constant struggle between developer teams and NOC teams to configure alerts correctly. Sometimes, NOC teams don’t alert development teams, due to miscommunication when configuring alerts, and development teams typically forget to inform monitoring teams when decommissioning systems, which leads to alerting configuration inconsistencies.

So, one of our project goals is to tackle this organizational challenge with DevOps approach.

Devops approach

With our new Zabbix Alertmanager integration, we gave more power to automation. We created a new git repository hosting Prometheus alerts. Each development team is given its own directory with an alerting configuration. The alerts are described in Prometheus YAML. Here is an example of an alert:

alert: JobFailed
expr: kube_job_status_failed{namespace=~"web"}  > 0
for: 1h
labels:
  severity: warning
  team: web

The new proposed process looks like this:

  • Developer teams configure their own alerts from Prometheus metrics. They can set severity and change thresholds directly in git.
  • Automation makes changes in Zabbix, automatically configuring hosts, triggers, and items.
  • NOC teams monitor Zabbix for alerts, and notifies the development teams.

Additionally developers gain new functionality:

Developer teams can take the system to maintenance by silencing alerts, thus stopping alerts from triggering in Zabbix.

No need to coordinate by emails.

Alertmanager provides nice UI just for doing that:

Developers can configure alerting time windows directly in the alerting configuration.

For example, let’s configure the alert to fire only during working hours:

alert: JobFailed
expr: kube_job_status_failed{namespace=~"web"}  > 0
      and ON() hour() > 9 < 17

Hopefully, this integration will improve the monitoring experience for both developer and NOC teams. We hope to see alert noise reduction and solve alert configuration drift.

In conclusion, we are releasing our Zabbix Alertmanager integration available with permissive Apache 2 license.

In our next post we will go through how to set up Zabbix Alertmanager Integration.

Need help with Kubernetes monitoring? Contact us.