zabiix alertmanager

Introduction

Recently we released a Zabbix Alertmanager Integration update. In the new v1.2.0 version, we made a bunch of operational improvements – one major addition being the bundled Grafana dashboard. The dashboard shows main zal send process metrics, such as how many alerts were recently and successfully sent to Zabbix.

Zabbix Alertmanager Grafana Dashboard

Most importantly, in the dashboard, you can see how many alerts have failed. This is usually due to misconfigured alert routing in Alertmanager, or an incorrect Hosts configuration file. If you see these errors please read the Setting Up Zabbix Alertmanager blog post, in addition, you can enable --log.level=debug flag to see a more detailed output

Monitoring Zabbix Alertmanager

In order to ensure that the whole monitoring pipeline is working, we highly recommend you set up a Dead Man Switch / Watchdog alert. The idea behind Dead Man Switch is pretty straightforward. You create a simple alerting rule, which always fires. Then in Zabbix, you check if you received that alert in the last 2-3 minutes. If you haven’t seen this alert, then you trigger an alert and escalate your issue to the right person. So, this is a simple way that allows us to trigger an alert when our monitoring pipeline is no longer functioning correctly

This is a black-box monitoring alert type, where you know the whole system is not working, but you don’t actually know which exact component is misbehaving. You should aim to have more modest black-box monitoring alerts, and a higher amount of white-box alerts.

Dead Man Switch is really simple to implement. Firstly, you create a simple Prometheus alert, secondly, you set the expression to vector(1) and make sure the alert always fires. Lastly, in order to make Zabbix monitor you need to add special nodata rule. nodata Zabbix expression triggers if no data was received during the defined period of time. With our Zabbix Alertmanager integration, it’s enough to add zabbix_trigger_nodata annotations to the alert, along with the desired time window – specified in seconds.

Here is the complete example:

 alert: DeadMansSwitch
annotations:
description: This is a DeadMansSwitch meant to ensure that the entire alerting
pipeline is functional.
summary: Alerting DeadMansSwitch
zabbix_trigger_nodata: "600"
expr: vector(1)
labels:
severity: critical
team: infra

Whitebox monitoring

Apart from regular Prometheus and Zabbix alerts, we highly recommend that you monitor the number of failed alerts. To clarify, Zabbix alertmanager integration sends alerts via zal send. This component has a metric named alerts_errors_total, which increases if it failed to send an alert. The same metric is shown in the dashboard, therefore we suggest you add a Prometheus alerting rule to check whether or not there were errors in the last 5 minutes. This is how it looks in Prometheus alerting rule configuration:

 alert: ZabbixAlertmanagerErrors
expr: sum(rate(alerts_errors_total{name="zal"}[5m])) > 0
annotations:
description: ZAL sender is having issues sending alerts to Zabbix. Please investigate.
grafana_url: http://GRAFANA_URL/d/maYkrFemz/zabbix-alertmanager?orgId=1&from=now-30m&to=now
labels:
severity: critical
team: infra

If this alert fires, just look at the zal send logs and you should see the reason as to why it can’t send alerts to Zabbix. Now, let’s take a look at deployment considerations.

Deployment

It’s really important to correctly deploy zal send component. That’s to say, you should have more than one instance running. In addition, you should spread those instances across many nodes, in order to minimize the impact of one failing node. This is really easy to accomplish, as zal send is completely stateless. If you are planning to deploy this to Kubernetes, simply use our manifest configuration.

Take a look at this Kubernetes manifest:

 apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: zal
spec:
minAvailable: 2
selector:
matchLabels:
k8s-app: zal
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
k8s-app: zal
name: zal
spec:
replicas: 1
selector:
matchLabels:
k8s-app: zal
template:
metadata:
labels:
name: zal
k8s-app: zal
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- zal
topologyKey: kubernetes.io/hostname
weight: 100
containers:
- name: zal
image: quay.io/devopyio/zabbix-alertmanager:v1.2.0
args:
- send
- --log.level=info
- --zabbix-addr=ZABBIX_ADDR:10051
- --default-host=infra
- --hosts-path=/etc/zal/sender-config.yml
ports:
- containerPort: 9095

Most importantly, we have a podAntiAffinity rule, which spreads instances across nodes. In addition, we utilizePodDisruptionBudget, which makes sure that at least one replica is always available. You can see the whole Kubernetes manifest in our repository.

Conclusion

In conclusion, having a reliable monitoring pipeline is critical to any company. With Dead Man Switch, black-box and white-box concepts presented in this blog post, reliability becomes an achievable task. In addition, our alerting rule and deployment examples displayed here should give you added confidence in this Zabibx Alertmanager integration.

Need help with reliable Prometheus & Zabbix integration? Contact us.