zabiix alertmanager

Introduction

Recently we released a Zabbix Alertmanager Integration update. In the new v1.2.0 version, we made a bunch of operational improvements – one major addition being the bundled Grafana dashboard. The dashboard shows main zal send process metrics, such as how many alerts were recently and successfully sent to Zabbix.

Zabbix Alertmanager Grafana Dashboard

Most importantly, in the dashboard, you can see how many alerts have failed. This is usually due to misconfigured alert routing in Alertmanager, or an incorrect Hosts configuration file. If you see these errors please read the Setting Up Zabbix Alertmanager blog post, in addition, you can enable --log.level=debug flag to see a more detailed output

Monitoring Zabbix Alertmanager

In order to ensure that the whole monitoring pipeline is working, we highly recommend you set up a Dead Man Switch / Watchdog alert. The idea behind Dead Man Switch is pretty straightforward. You create a simple alerting rule, which always fires. Then in Zabbix, you check if you received that alert in the last 2-3 minutes. If you haven’t seen this alert, then you trigger an alert and escalate your issue to the right person. So, this is a simple way that allows us to trigger an alert when our monitoring pipeline is no longer functioning correctly

This is a black-box monitoring alert type, where you know the whole system is not working, but you don’t actually know which exact component is misbehaving. You should aim to have more modest black-box monitoring alerts, and a higher amount of white-box alerts.

Dead Man Switch is really simple to implement. Firstly, you create a simple Prometheus alert, secondly, you set the expression to vector(1) and make sure the alert always fires. Lastly, in order to make Zabbix monitor you need to add special nodata rule. nodata Zabbix expression triggers if no data was received during the defined period of time. With our Zabbix Alertmanager integration, it’s enough to add zabbix_trigger_nodata annotations to the alert, along with the desired time window – specified in seconds.

Here is the complete example:

 alert: DeadMansSwitch
annotations:
description: This is a DeadMansSwitch meant to ensure that the entire alerting
pipeline is functional.
summary: Alerting DeadMansSwitch
zabbix_trigger_nodata: "600"
expr: vector(1)
labels:
severity: critical
team: infra

Whitebox monitoring

Apart from regular Prometheus and Zabbix alerts, we highly recommend that you monitor the number of failed alerts. To clarify, Zabbix alertmanager integration sends alerts via zal send. This component has a metric named alerts_errors_total, which increases if it failed to send an alert. The same metric is shown in the dashboard, therefore we suggest you add a Prometheus alerting rule to check whether or not there were errors in the last 5 minutes. This is how it looks in Prometheus alerting rule configuration:

 alert: ZabbixAlertmanagerErrors
expr: sum(rate(alerts_errors_total{name="zal"}[5m])) > 0
annotations:
description: ZAL sender is having issues sending alerts to Zabbix. Please investigate.
grafana_url: http://GRAFANA_URL/d/maYkrFemz/zabbix-alertmanager?orgId=1&from=now-30m&to=now
labels:
severity: critical
team: infra

If this alert fires, just look at the zal send logs and you should see the reason as to why it can’t send alerts to Zabbix. Now, let’s take a look at deployment considerations.

Deployment

It’s really important to correctly deploy zal send component. That’s to say, you should have more than one instance running. In addition, you should spread those instances across many nodes, in order to minimize the impact of one failing node. This is really easy to accomplish, as zal send is completely stateless. If you are planning to deploy this to Kubernetes, simply use our manifest configuration.

Take a look at this Kubernetes manifest:

 apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zal
spec:
minAvailable: 2
selector:
matchLabels:
k8s-app: zal
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: zal
name: zal
spec:
replicas: 1
selector:
matchLabels:
k8s-app: zal
template:
metadata:
labels:
name: zal
k8s-app: zal
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- zal
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- name: zal
image: quay.io/devopyio/zabbix-alertmanager:v1.2.0
args:
- send
- --log.level=info
- --zabbix-addr=ZABBIX_ADDR:10051
- --default-host=infra
- --hosts-path=/etc/zal/sender-config.yml
ports:
- containerPort: 9095

Most importantly, we have a podAntiAffinity rule, which spreads instances across nodes. In addition, we utilizePodDisruptionBudget, which makes sure that at least one replica is always available. You can see the whole Kubernetes manifest in our repository.

Conclusion

In conclusion, having a reliable monitoring pipeline is critical to any company. With Dead Man Switch, black-box and white-box concepts presented in this blog post, reliability becomes an achievable task. In addition, our alerting rule and deployment examples displayed here should give you added confidence in this Zabibx Alertmanager integration.

Need help with reliable Prometheus & Zabbix integration? Contact us.

zabiix alertmanager

Zabbix Alertmanager

Recently, we released the Zabbix Alertmanager integration to the open source, which can be downloaded from the GitHub page. And, in this post, we are going to dive deeper into how the Zabbix Alertmanager integration works.

When we first began working on this integration, we looked at a similar project made by Gmauleon. It’s a really great project that we took a lot of inspiration from, but we quickly realized that we needed to make some major changes. Our project is written in Go and released as a standalone binary, which we called zal. The integration consists of 2 separate commands:

  1. zal prov command, which converts Prometheus Alerting rules into Zabbix Triggers.
  2. zal send command, which listens for Alert requests from Alertmanager and sends them to Zabbix.

Alert provisioning

The zal prov command is used to create Zabbix Triggers from Prometheus Alerting rule definitions. It’s a simple executable binary. Almost like a shell script, it runs to completion and returns exits, with status code 0 upon success. Otherwise, it will fail and print an error message. This feature opens up many different deployment options. You can deploy alert provisioning using Cron Job, which periodically checks configuration and creates trigger. CI Job, which would create triggers on alert configuration change or a regular bash script.

For our customers, we recommend Alert provisioning as a part of a Gitlab Continuous Integration job. So, we decided to store all Prometheus alerting configuration in one git repository. Developers create a change in alerting rules via Pull Request in Git, then CI job runs promtool check rules command, which validates configuration. Once Pull Request is merged, we will automatically provision alerting rules into Zabbix. Here is an example of .gitlab-ci.yml:

stages:
check
provision
check-alerts:
stage: check
image:
name: prom/prometheus:v2.8.0
entrypoint: ["/bin/sh", "-c"]
script:
- promtool check rules /.yml
provision-rules:
stage: provision
image:
name: devopyio/zabbix-alertmanager:v1.1.1
entrypoint: ["/bin/sh", "-c"]
script:
- zal prov --log.level=info --config-path=zal-config.yaml
--url=http://ZABBIX_URL/api_jsonrpc.php
--prometheus-url=http://PROMETHEUS_URL
only:
- master

Getting started

In order to run zal prov, you will need to set up Zabbix User. This Zabbix User has to have access to Zabbix API. Also this user requires elevated permissions to update Hosts, create Host Items, Triggers & Zabbix Applications. You can read more about Zabbix API & user permissions in Zabbix API manual.

When you first try it out, we suggest that you manually create a Host Group along with some empty Hosts. After that, create a Zabbix user, and allow this user to access that Host Group and enable Zabbix API access. Be sure to make note of your configuration, though, as you will need to provide these values to zal prov via --user, --password, --url flags or ZABBIX_USER, ZABBIX_PASSWORD, ZABBIX_URL environment variables. We recommend setting user credential data via environment variables, in order to keep them secret.

Configuring Hosts

After empty hosts are created and a user is set up, you need to specify Host configurations. It’s a simple YAML configuration file. Take a look at this example:

 - name: infrahost
hostGroups: INFRA
tag: prometheus
deploymentStatus: 0
itemDefaultApplication: prometheus
itemDefaultHistory: 5d
itemDefaultTrends: 5d
itemDefaultTrapperHosts: 0.0.0.0/0
triggerTags:
INFRA: ""
alertsDir: ./infra

- name: webhost
hostGroups: WEBTEAM
tag: prometheus
deploymentStatus: 0
itemDefaultApplication: prometheus
itemDefaultHistory: 5d
itemDefaultTrends: 5d
itemDefaultTrapperHosts: 0.0.0.0/0
triggerTags:
WEBT: ""
alertsDir: ./web

In this example, we create two Zabbix hosts. One is named infrahost and placed in the host group,INFRA. Additionally, we add a prometheus tag to this host, and store only 5 days worth of history. This configuration will be shown in Zabbix web UI. Forinfrahost we will provision Zabbix triggers from Prometheus alerts in the./infra directory. Lastly, we add the INFRA tag on those triggers.

Similarly, we do the same for any host named webhost in the host group WEBTEAM, and provision alerts from the ./web directory. Multiple hosts and multiple alert directories allows us to separate teams as well as their alerts. In this case, we have an Infrastructure team, which will see their alerts in infrahost host, along with a Web developer team, which will get their alerts in the webhost Zabbix host.

Alerting configuration

In alertsDir we expect Prometheus Alerting rules to be saved in files ending with .yml or .yaml extensions. There are some special rules when creating Alerts in Zabbix, which provide a more native Zabbix experience for Prometheus Alerts.

Here are the rules:

  1. If the Prometheus URL is configured, we setup a Trigger URL to link to Prometheus Query (Only if URL is shorter than 255 symbols, as Zabbix doesn’t support longer URLs).
  2. Trigger’s Comment field is set from Alert’s summary, message or description annotation.
  3. Trigger’s Severity is configured via severity label. Severity label can have one of information, warning, average, high, critical values.
  4. If Alerting rule has special zabbix_trigger_nodata annotation, we set up a special Zabbix nodata trigger expression. Annotation’s value must be a number, which is the evaluation period in seconds.

Use configuration file via zal prov --config-path flag. In order to make triggers, link to Prometheus and add the --prometheus-url flag. You can get more information by executing zal --help and zal prov --help commands.

Alert sending

Once Alert provisioning has successfully completed, you can start sending alerts Zabbix. zal send command listens for alerts from Alertmanager, via webhook receiver and sends them into Zabbix, via Zabbix Sender Protocol. You can read more about the protocol and how it works in th Zabbix Trapper items section.

In order to run zal send, you will need to set --zabbix-addr to point to the Zabbix server trapper port. By default it is listening on Zabbix server’s 10051 port. You then need to configure--addr, which is addressed to listen for Alertmanager’s Webhook requests (default is 0.0.0.0:9095). Also, you will need to provide --hosts-path, which is pointing to the zal send host configuration file.

Hosts configuration file is used to route alerts to correct Zabbix hosts. Let’s say you have two hosts,infrahost for infrastructure alerts, and webhostfor web developer alerts. This would give you two mappings:

# Resolver name to zabbix host mapping
infra: infrahost
web: webhost

The first part of this configuration is actually alertmanager’s receiver name. In this example, alerts coming from Alertmanager’s infra receiver will go to infrahost Zabbix host, and web receiver’s alerts will go to webhosthost. This configuration needs to be inline with Alertmanager’s configuration. Let’s take a look at this Alertmanager’s configuration example:

global:

route:
group_by: ['alertname', 'team']
group_wait: 30s
group_interval: 2m
repeat_interval: 3m
receiver: infra
routes:
- receiver: web
match_re:
team: web
- receiver: infra
match_re:
team: infra

receivers:
- name: 'infra'
webhook_configs:
- url: http://ZAL_SENDER_ADDR/alerts
send_resolved: true
- name: 'web'
webhook_configs:
- url: http://ZAL_SENDER_ADDR/alerts
send_resolved: true

This configuration routes alerts with label team: web to web receiver, and team: infra to infra receiver. Note that receivers must be configured via webhook_configs and, for each team, there must be a separate receiver configuration. In this example, we have one receiver for the Web developer team and one receiver for the Infrastructure team. ZAL_SENDER_ADDR is the address of zal send, which we configured via --addr flag.

If we fail to correctly route an alert (in the case where alert doesn’t have a team label), it will end up in infra receiver. If we forget to configure the Hosts configuration file, it will default to the value specified in --default-hostflag.

Conclusion

These are the main things you need to know in order to successfully run your Zabbix Alertmanager integration. In the next post, we will take a look at deployment considerations of Zal sender, and see how we can ensure the whole system runs reliably.

Need help integrating Prometheus with Zabbix? Contact us.