Introduction
Recently we released a Zabbix Alertmanager Integration update. In the new v1.2.0 version, we made a bunch of operational improvements – one major addition being the bundled Grafana dashboard. The dashboard shows zal send

Most importantly, in the dashboard, you can see how many alerts have failed. This is usually due to misconfigured alert routing in --log.level=debug
flag to see a more detailed output
Monitoring Zabbix Alertmanager
In order to ensure that the whole monitoring pipeline is working, we highly recommend you set up a Dead Man Switch / Watchdog alert. The idea behind Dead Man Switch is pretty straightforward. You create a simple alerting rule, which always fires. Then in Zabbix, you check if you received that alert in the last 2-3 minutes. If you haven’t seen this alert, then you trigger an alert and escalate your issue to the right person. So, this is a simple way that allows us to trigger an alert when our monitoring pipeline is no longer functioning correctly
This is a black-box monitoring alert type, where you know the whole system is not working, but you don’t actually know which exact component is misbehaving. You should aim to have more modest black-box monitoring alerts, and a higher amount of white-box alerts.
Dead Man Switch is really simple to implement. Firstly, you create a simple Prometheus alert, secondly, you set the expression vector(1)
nodata
nodata
Zabbix expression triggers if no data was received during the defined period of time. With our Zabbix zabbix_trigger_nodata
Here is the complete example:
alert: DeadMansSwitch
annotations:
description: This is a DeadMansSwitch meant to ensure that the entire alerting
pipeline is functional.
summary: Alerting DeadMansSwitch
zabbix_trigger_nodata: "600"
expr: vector(1)
labels:
severity: critical
team: infra
Whitebox monitoring
Apart from regular Prometheus and Zabbix alerts, we highly recommend that you monitor the number of failed alerts. To clarify, Zabbix zal send
alerts_errors_total
alert: ZabbixAlertmanagerErrors
expr: sum(rate(alerts_errors_total{name="zal"}[5m])) > 0
annotations:
description: ZAL sender is having issues sending alerts to Zabbix. Please investigate.
grafana_url: http://GRAFANA_URL/d/maYkrFemz/zabbix-alertmanager?orgId=1&from=now-30m&to=now
labels:
severity: critical
team: infra
If this alert fires, just look at zal send
Deployment
It’s really important to correctly zal send
zal send
Take a look at this Kubernetes manifest:
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zal
spec:
minAvailable: 2
selector:
matchLabels:
k8s-app: zal
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: zal
name: zal
spec:
replicas: 1
selector:
matchLabels:
k8s-app: zal
template:
metadata:
labels:
name: zal
k8s-app: zal
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- zal
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- name: zal
image: quay.io/devopyio/zabbix-alertmanager:v1.2.0
args:
- send
- --log.level=info
- --zabbix-addr=ZABBIX_ADDR:10051
- --default-host=infra
- --hosts-path=/etc/zal/sender-config.yml
ports:
- containerPort: 9095
Most importantly, we have a podAntiAffinity
rule, which spreads instances across nodes. In addition, we utilizePodDisruptionBudget
, which makes sure that at least one replica is always available. You can see the whole Kubernetes manifest in our repository.
Conclusion
In conclusion, having a reliable monitoring pipeline is critical to any company. With Dead Man Switch, black-box and white-box concepts presented in this blog post, reliability becomes an achievable task. In addition, our alerting rule and deployment examples displayed here should give you added confidence in this
Need help with reliable Prometheus & Zabbix integration? Contact us.