Posts about systems Monitoring

zabiix alertmanager

Introduction

Recently we released a Zabbix Alertmanager Integration update. In the new v1.2.0 version, we made a bunch of operational improvements – one major addition being the bundled Grafana dashboard. The dashboard shows main zal send process metrics, such as how many alerts were recently and successfully sent to Zabbix.

Zabbix Alertmanager Grafana Dashboard

Most importantly, in the dashboard, you can see how many alerts have failed. This is usually due to misconfigured alert routing in Alertmanager, or an incorrect Hosts configuration file. If you see these errors please read the Setting Up Zabbix Alertmanager blog post, in addition, you can enable --log.level=debug flag to see a more detailed output

Monitoring Zabbix Alertmanager

In order to ensure that the whole monitoring pipeline is working, we highly recommend you set up a Dead Man Switch / Watchdog alert. The idea behind Dead Man Switch is pretty straightforward. You create a simple alerting rule, which always fires. Then in Zabbix, you check if you received that alert in the last 2-3 minutes. If you haven’t seen this alert, then you trigger an alert and escalate your issue to the right person. So, this is a simple way that allows us to trigger an alert when our monitoring pipeline is no longer functioning correctly

This is a black-box monitoring alert type, where you know the whole system is not working, but you don’t actually know which exact component is misbehaving. You should aim to have more modest black-box monitoring alerts, and a higher amount of white-box alerts.

Dead Man Switch is really simple to implement. Firstly, you create a simple Prometheus alert, secondly, you set the expression to vector(1) and make sure the alert always fires. Lastly, in order to make Zabbix monitor you need to add special nodata rule. nodata Zabbix expression triggers if no data was received during the defined period of time. With our Zabbix Alertmanager integration, it’s enough to add zabbix_trigger_nodata annotations to the alert, along with the desired time window – specified in seconds.

Here is the complete example:

 alert: DeadMansSwitch
annotations:
description: This is a DeadMansSwitch meant to ensure that the entire alerting
pipeline is functional.
summary: Alerting DeadMansSwitch
zabbix_trigger_nodata: "600"
expr: vector(1)
labels:
severity: critical
team: infra

Whitebox monitoring

Apart from regular Prometheus and Zabbix alerts, we highly recommend that you monitor the number of failed alerts. To clarify, Zabbix alertmanager integration sends alerts via zal send. This component has a metric named alerts_errors_total, which increases if it failed to send an alert. The same metric is shown in the dashboard, therefore we suggest you add a Prometheus alerting rule to check whether or not there were errors in the last 5 minutes. This is how it looks in Prometheus alerting rule configuration:

 alert: ZabbixAlertmanagerErrors
expr: sum(rate(alerts_errors_total{name="zal"}[5m])) > 0
annotations:
description: ZAL sender is having issues sending alerts to Zabbix. Please investigate.
grafana_url: http://GRAFANA_URL/d/maYkrFemz/zabbix-alertmanager?orgId=1&from=now-30m&to=now
labels:
severity: critical
team: infra

If this alert fires, just look at the zal send logs and you should see the reason as to why it can’t send alerts to Zabbix. Now, let’s take a look at deployment considerations.

Deployment

It’s really important to correctly deploy zal send component. That’s to say, you should have more than one instance running. In addition, you should spread those instances across many nodes, in order to minimize the impact of one failing node. This is really easy to accomplish, as zal send is completely stateless. If you are planning to deploy this to Kubernetes, simply use our manifest configuration.

Take a look at this Kubernetes manifest:

 apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zal
spec:
minAvailable: 2
selector:
matchLabels:
k8s-app: zal
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: zal
name: zal
spec:
replicas: 1
selector:
matchLabels:
k8s-app: zal
template:
metadata:
labels:
name: zal
k8s-app: zal
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- zal
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- name: zal
image: quay.io/devopyio/zabbix-alertmanager:v1.2.0
args:
- send
- --log.level=info
- --zabbix-addr=ZABBIX_ADDR:10051
- --default-host=infra
- --hosts-path=/etc/zal/sender-config.yml
ports:
- containerPort: 9095

Most importantly, we have a podAntiAffinity rule, which spreads instances across nodes. In addition, we utilizePodDisruptionBudget, which makes sure that at least one replica is always available. You can see the whole Kubernetes manifest in our repository.

Conclusion

In conclusion, having a reliable monitoring pipeline is critical to any company. With Dead Man Switch, black-box and white-box concepts presented in this blog post, reliability becomes an achievable task. In addition, our alerting rule and deployment examples displayed here should give you added confidence in this Zabibx Alertmanager integration.

Need help with reliable Prometheus & Zabbix integration? Contact us.

zabiix alertmanager

Zabbix Alertmanager

Recently, we released the Zabbix Alertmanager integration to the open source, which can be downloaded from the GitHub page. And, in this post, we are going to dive deeper into how the Zabbix Alertmanager integration works.

When we first began working on this integration, we looked at a similar project made by Gmauleon. It’s a really great project that we took a lot of inspiration from, but we quickly realized that we needed to make some major changes. Our project is written in Go and released as a standalone binary, which we called zal. The integration consists of 2 separate commands:

  1. zal prov command, which converts Prometheus Alerting rules into Zabbix Triggers.
  2. zal send command, which listens for Alert requests from Alertmanager and sends them to Zabbix.

Alert provisioning

The zal prov command is used to create Zabbix Triggers from Prometheus Alerting rule definitions. It’s a simple executable binary. Almost like a shell script, it runs to completion and returns exits, with status code 0 upon success. Otherwise, it will fail and print an error message. This feature opens up many different deployment options. You can deploy alert provisioning using Cron Job, which periodically checks configuration and creates trigger. CI Job, which would create triggers on alert configuration change or a regular bash script.

For our customers, we recommend Alert provisioning as a part of a Gitlab Continuous Integration job. So, we decided to store all Prometheus alerting configuration in one git repository. Developers create a change in alerting rules via Pull Request in Git, then CI job runs promtool check rules command, which validates configuration. Once Pull Request is merged, we will automatically provision alerting rules into Zabbix. Here is an example of .gitlab-ci.yml:

stages:
check
provision
check-alerts:
stage: check
image:
name: prom/prometheus:v2.8.0
entrypoint: ["/bin/sh", "-c"]
script:
- promtool check rules /.yml
provision-rules:
stage: provision
image:
name: devopyio/zabbix-alertmanager:v1.1.1
entrypoint: ["/bin/sh", "-c"]
script:
- zal prov --log.level=info --config-path=zal-config.yaml
--url=http://ZABBIX_URL/api_jsonrpc.php
--prometheus-url=http://PROMETHEUS_URL
only:
- master

Getting started

In order to run zal prov, you will need to set up Zabbix User. This Zabbix User has to have access to Zabbix API. Also this user requires elevated permissions to update Hosts, create Host Items, Triggers & Zabbix Applications. You can read more about Zabbix API & user permissions in Zabbix API manual.

When you first try it out, we suggest that you manually create a Host Group along with some empty Hosts. After that, create a Zabbix user, and allow this user to access that Host Group and enable Zabbix API access. Be sure to make note of your configuration, though, as you will need to provide these values to zal prov via --user, --password, --url flags or ZABBIX_USER, ZABBIX_PASSWORD, ZABBIX_URL environment variables. We recommend setting user credential data via environment variables, in order to keep them secret.

Configuring Hosts

After empty hosts are created and a user is set up, you need to specify Host configurations. It’s a simple YAML configuration file. Take a look at this example:

 - name: infrahost
hostGroups: INFRA
tag: prometheus
deploymentStatus: 0
itemDefaultApplication: prometheus
itemDefaultHistory: 5d
itemDefaultTrends: 5d
itemDefaultTrapperHosts: 0.0.0.0/0
triggerTags:
INFRA: ""
alertsDir: ./infra

- name: webhost
hostGroups: WEBTEAM
tag: prometheus
deploymentStatus: 0
itemDefaultApplication: prometheus
itemDefaultHistory: 5d
itemDefaultTrends: 5d
itemDefaultTrapperHosts: 0.0.0.0/0
triggerTags:
WEBT: ""
alertsDir: ./web

In this example, we create two Zabbix hosts. One is named infrahost and placed in the host group,INFRA. Additionally, we add a prometheus tag to this host, and store only 5 days worth of history. This configuration will be shown in Zabbix web UI. Forinfrahost we will provision Zabbix triggers from Prometheus alerts in the./infra directory. Lastly, we add the INFRA tag on those triggers.

Similarly, we do the same for any host named webhost in the host group WEBTEAM, and provision alerts from the ./web directory. Multiple hosts and multiple alert directories allows us to separate teams as well as their alerts. In this case, we have an Infrastructure team, which will see their alerts in infrahost host, along with a Web developer team, which will get their alerts in the webhost Zabbix host.

Alerting configuration

In alertsDir we expect Prometheus Alerting rules to be saved in files ending with .yml or .yaml extensions. There are some special rules when creating Alerts in Zabbix, which provide a more native Zabbix experience for Prometheus Alerts.

Here are the rules:

  1. If the Prometheus URL is configured, we setup a Trigger URL to link to Prometheus Query (Only if URL is shorter than 255 symbols, as Zabbix doesn’t support longer URLs).
  2. Trigger’s Comment field is set from Alert’s summary, message or description annotation.
  3. Trigger’s Severity is configured via severity label. Severity label can have one of information, warning, average, high, critical values.
  4. If Alerting rule has special zabbix_trigger_nodata annotation, we set up a special Zabbix nodata trigger expression. Annotation’s value must be a number, which is the evaluation period in seconds.

Use configuration file via zal prov --config-path flag. In order to make triggers, link to Prometheus and add the --prometheus-url flag. You can get more information by executing zal --help and zal prov --help commands.

Alert sending

Once Alert provisioning has successfully completed, you can start sending alerts Zabbix. zal send command listens for alerts from Alertmanager, via webhook receiver and sends them into Zabbix, via Zabbix Sender Protocol. You can read more about the protocol and how it works in th Zabbix Trapper items section.

In order to run zal send, you will need to set --zabbix-addr to point to the Zabbix server trapper port. By default it is listening on Zabbix server’s 10051 port. You then need to configure--addr, which is addressed to listen for Alertmanager’s Webhook requests (default is 0.0.0.0:9095). Also, you will need to provide --hosts-path, which is pointing to the zal send host configuration file.

Hosts configuration file is used to route alerts to correct Zabbix hosts. Let’s say you have two hosts,infrahost for infrastructure alerts, and webhostfor web developer alerts. This would give you two mappings:

# Resolver name to zabbix host mapping
infra: infrahost
web: webhost

The first part of this configuration is actually alertmanager’s receiver name. In this example, alerts coming from Alertmanager’s infra receiver will go to infrahost Zabbix host, and web receiver’s alerts will go to webhosthost. This configuration needs to be inline with Alertmanager’s configuration. Let’s take a look at this Alertmanager’s configuration example:

global:

route:
group_by: ['alertname', 'team']
group_wait: 30s
group_interval: 2m
repeat_interval: 3m
receiver: infra
routes:
- receiver: web
match_re:
team: web
- receiver: infra
match_re:
team: infra

receivers:
- name: 'infra'
webhook_configs:
- url: http://ZAL_SENDER_ADDR/alerts
send_resolved: true
- name: 'web'
webhook_configs:
- url: http://ZAL_SENDER_ADDR/alerts
send_resolved: true

This configuration routes alerts with label team: web to web receiver, and team: infra to infra receiver. Note that receivers must be configured via webhook_configs and, for each team, there must be a separate receiver configuration. In this example, we have one receiver for the Web developer team and one receiver for the Infrastructure team. ZAL_SENDER_ADDR is the address of zal send, which we configured via --addr flag.

If we fail to correctly route an alert (in the case where alert doesn’t have a team label), it will end up in infra receiver. If we forget to configure the Hosts configuration file, it will default to the value specified in --default-hostflag.

Conclusion

These are the main things you need to know in order to successfully run your Zabbix Alertmanager integration. In the next post, we will take a look at deployment considerations of Zal sender, and see how we can ensure the whole system runs reliably.

Need help integrating Prometheus with Zabbix? Contact us.

zabiix alertmanager

Zabbix

Over the years, Zabbix has become the standard monitoring system for host or virtual machines. You can run the Zabbix agent to get metrics for Network, CPU, Memory, Disk usage and begin monitoring and setup simple alerting for your hosts. Even some applications support exporting data to Zabbix. For example, you can integrate JVM monitoring data by running the Zabbix Java gateway or a custom JVM agent, like Zorka.

Zabbix has a hierarchical model. You begin by creating a group of hosts, then you create your physical or virtual hosts. And finally you add them to your group. Each individual host can house many items, which store metric information. One typical item may house actual available disk space. In order to get alerts in regards to low disk space, you simply add triggers to your host. Triggers are configured via simple expressions, like if available disk space is less than 5%, then notify. Zabbix has basic expression language, where you can use avg (which computes the average) or nodata, trigger an alert if no item data was received. The list of available expressions can be found here.

Prometheus

Although it is beneficial for monitoring hosts and applications running in those hosts, it becomes harder to monitor cloud environments like Kubernetes. In Kubernetes, applications change hosts quickly and resource restrictions are configured at the application level, not the host level. This means host checks would show good results, while applications might be constantly crashing due to resource restrictions. By design, Kubernetes applications change their identity on deployment, which makes them next to impossible to organize into a hierarchical model.

This is why SoundCloud originally built Prometheus, an open-source system monitoring and alerting toolkit. Since its creation, many companies and organizations have adopted Prometheus, and the project is maintained by a very active developer and user community. It is now a standalone open source project and maintained independently of any company. To emphasize this, and to clarify the project’s governance structure, Prometheus joined the Cloud Native Computing Foundation as the second hosted project after Kubernetes. Active members from the Prometheus community are also working on open metrics, which evolves Prometheus metrics exposition format into a standard. These specifics make Prometheus a safe choice for any company.

Combining Zabbix strength with Prometheus

So, why did we decide to marry those two systems together?

One of our clients is a leading telecommunication provider with an NOC / Monitoring team, which directly responds to alerts from Zabbix. This company runs a mix of virtual machines and containers running on Kubernetes. Moving away completely from one system to another is simply not an option. Thus, we decided that integrating the systems together was the best solution because it gives the teams the flexibility to try Prometheus without breaking current flow.

During the project, we tried to automate as much as possible.

Previously configuring proper alerts would involve an organizational process like the following:

  • The developer team would expose data via custom integrations like Zorka or custom made bash scripts. They would then inform the NOC team to configure alerts.
  • The NOC team would ask the developer team when to call and what kind of severity the application is. NOC team would then go and configure the Zabbix trigger.

If developers want to change the alerting threshold, severity, or even completely remove an alert, they have to contact the NOC team. This leaves a constant struggle between developer teams and NOC teams to configure alerts correctly. Sometimes, NOC teams don’t alert development teams, due to miscommunication when configuring alerts, and development teams typically forget to inform monitoring teams when decommissioning systems, which leads to alerting configuration inconsistencies.

So, one of our project goals is to tackle this organizational challenge with DevOps approach.

Devops approach

With our new Zabbix Alertmanager integration, we gave more power to automation. We created a new git repository hosting Prometheus alerts. Each development team is given its own directory with an alerting configuration. The alerts are described in Prometheus YAML. Here is an example of an alert:

alert: JobFailed
expr: kube_job_status_failed{namespace=~"web"}  > 0
for: 1h
labels:
  severity: warning
  team: web

The new proposed process looks like this:

  • Developer teams configure their own alerts from Prometheus metrics. They can set severity and change thresholds directly in git.
  • Automation makes changes in Zabbix, automatically configuring hosts, triggers, and items.
  • NOC teams monitor Zabbix for alerts, and notifies the development teams.

Additionally developers gain new functionality:

Developer teams can take the system to maintenance by silencing alerts, thus stopping alerts from triggering in Zabbix.

No need to coordinate by emails.

Alertmanager provides nice UI just for doing that:

Developers can configure alerting time windows directly in the alerting configuration.

For example, let’s configure the alert to fire only during working hours:

alert: JobFailed
expr: kube_job_status_failed{namespace=~"web"}  > 0
      and ON() hour() > 9 < 17

Hopefully, this integration will improve the monitoring experience for both developer and NOC teams. We hope to see alert noise reduction and solve alert configuration drift.

In conclusion, we are releasing our Zabbix Alertmanager integration available with permissive Apache 2 license.

In our next post we will go through how to set up Zabbix Alertmanager Integration.

Need help with Kubernetes monitoring? Contact us.