Alertmanager Integration¶
Lenses comes with an alerting subsystem that can be tailored to match individual needs. You can find more information about the alerting subsystem at the user guide section.
For an alerting system to be complete, there is usually the requirement for alerts management and notifications. In simple terms, there has to be a way for an alert to reach the proper team within the proper time frame. And all of this without overwhelming the team with alerts which do not involve it, duplicate the entries, or produce alerts which are a byproduct of a top-level alert. Hence, Lenses integrates with the Alertmanager software which provides alerts de-duplication, grouping and routing via various systems (such as email, pager-duty, slack) as well as silencing and inhibition.
Lenses Configuration¶
To enable the Alertmanager integration, you need to configure the included Alertsmanager plugin - see AlertManager for details.
Alerts Attributes¶
Although Alertmanager inner workings and configuration is beyond the scope of this guide, it is useful to briefly go into some of the details. This way it will be easier to get the most out of this feature.
Alerts are posted to Alertmanager as JSON objects. Each alert has a set of labels and a set of annotations. The set of labels is what uniquely identifies the alert, whilst the annotations serve as further elaborate descriptions of the event.
Alertmanager can use the set of labels to deduplicate, group, route, silence and inhibit alerts, whilst the annotations (and a field called generatorURL) can be sent, along with the labels, to a recipient to help quickly understand the issue.
Lenses alerts offer these main labels [1]:
label name | description | values |
---|---|---|
category | the category of the alert | Infrastructure , Consumers , Kafka Connect , Topics |
instance | the URL or the subsystem that triggered the event | can be the address of a broker, a description like UnderReplication, etc |
severity | the severity of the event | INFO , MEDIUM , HIGH , CRITICAL [2] |
[1] | There are more labels actually but vary by instance. Only these three are present in all alerts and can be considered stable, so that the Alertmanager configuration may be built around them. |
[2] | There is also a LOW level but it is not in use currently. |
Lenses alerts have these annotations:
annotation name | description | values |
---|---|---|
source | the source of the event | default is Lenses unless configured otherwise |
summary | the summary of the event | depends on the alert |
Alertmanager Example¶
In the example below, Alertmanager is configured with three receivers: default, urgent and emergency. There are three backends available as well: email, slack and pushover. The default receiver sends events only to slack. The urgent receiver sends events to both email and slack, whilst the emergency receiver sends to all three backends.
The routing rules send Infrastructure
category events of HIGH
or
CRITICAL
severity to the emergency receiver so the team can receive push
notifications to their mobile phones and act immediately. The rest of the
categories of events of HIGH
or CRITICAL
severity are sent to the
urgent receiver, so the proper team member can get email notifications. At
last, all other events (events that didn’t match a routing rule) will be sent to
the default receiver, which will post them to a Slack channel, where a member
of the team can look at a time of convenience.
Also, two inhibition rules are set. If a CRITICAL
alert is triggered,
Alertmanager will not send notifications for any other events until the
CRITICAL
issue is resolved. This is because the main problem (maybe a
broker can no more serve requests) will cause more problems. Team members should
not be flooded with notifications but rather get one notification for the root
cause. The second inhibition rule applies to events of severity
HIGH
. In that case events from the same instance but with lower severity
will be inhibited until the main alert for this instance is resolved.
For the Slack notification, a custom text is set, which includes the summary, source and generatorURL.
global:
slack_api_url: https://hooks.slack.com/services/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
smtp_from: alertmanager@example.com
smtp_smarthost: smtp.example.com:25
smtp_auth_username: SMTP_USER
smtp_auth_password: SMTP_PASS
route:
# If an alert does not match any rule, it goes to the default:
receiver: 'default'
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
group_by: [category,severity]
routes:
- receiver: emergency
match_re:
severity: HIGH|CRITICAL
category: Infrastructure
- receiver: urgent
match:
severity: HIGH|CRITICAL
inhibit_rules:
- source_match:
severity: 'CRITICAL'
target_match_re:
severity: 'INFO|MEDIUM|HIGH'
equal: ['source']
- source_match:
severity: 'HIGH'
target_match_re:
severity: 'INFO|MEDIUM'
equal: ['instance','source']
receivers:
- name: 'default'
slack_configs:
- channel: alerts
send_resolved: true
text: {% raw %}"{{ range .Alerts }}{{ .Labels.instance }}: {{ .Annotations.summary }}.\nVia: {{ .Annotations.source }}\nGenerator: {{ .GeneratorURL }}\n{{ end }}"{% endraw %}
- name: 'urgent'
email_configs:
- to: 'user1@example.com, user2@example.com, user3@example.com'
slack_configs:
- channel: alerts
send_resolved: true
text: {% raw %}"{{ range .Alerts }}{{ .Labels.instance }}: {{ .Annotations.summary }}.\nVia: {{ .Annotations.source }}\nGenerator: {{ .GeneratorURL }}\n{{ end }}"{% endraw %}
- name: 'emergency'
pushover_configs:
- user_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
expire: 2m
email_configs:
- to: 'user1@example.com, user2@example.com, user3@example.com'
slack_configs:
- channel: alerts
send_resolved: true
text: {% raw %}"{{ range .Alerts }}{{ .Labels.instance }}: {{ .Annotations.summary }}\nVia: {{ .Annotations.source }}\nGenerator: {{ .GeneratorURL }}\n{{ end }}"{% endraw %}
Alerts Reference¶
Each alert is configured to fire on multiple criteria. Depending on the conditions, several tags are applied to each alert (for example topic name) and the description and severity of the alert are changed appropriately. The list of alerts Lenses can produce, along with the category, instance and severity can be found below.
Please note that alerts of severity INFO
** are not sent to** Alertmanager.
Alert | Description | Category | Instance | Severity |
---|---|---|---|---|
TopicAdded | New topic was added
|
Topics | topic | INFO |
TopicDelete | Topic was deleted
|
Topics | topic | INFO |
ConnectorDeleted | Connector was deleted
|
KafkaConnector | connector name | INFO |
SchemaRegistryStatus | Status of Schema Registry
|
Infrastructure | service URL | INFO, HIGH |
UnderReplicatedPartitions | Some partitions are under replicated
|
Infrastructure | partitions | INFO, HIGH |
FailedProduceRequestPerSec | Rate of failed requests is above threshold
|
Infrastructure | brokerID | INFO, HIGH |
ConsumerLag | A consumer group is falling behind
|
Consumers | topic | INFO, HIGH |
LeaderImbalance | A broker has a large number of leader replicas
|
Infrastructure | brokerID | INFO |
FileOpenDescriptorsCapacity | A broker has too many open file descriptors
|
Infrastructure | brokerID | INFO, HIGH, CRITICAL |
ActiveControllers | High number of active controllers
|
Infrastructure | brokers | HIGH, INFO |
ConnectStatus | Connect client has gone offline
|
Infrastructure | worker URL | MEDIUM |
ZookeeperStatus | Zookeeper node is offline
|
Infrastructure | service name | INFO, CRITICAL |
RequestHandlerAvgIdlePercent | Broker is almost fully utilized
|
Infrastructure | brokerID | INFO, HIGH, CRITICAL |
MultipleBrokerVersions | Brokers version mismatch
|
Infrastructure | brokers versions | INFO, HIGH |
PartitionsOffline | Some partitions are offline
|
Infrastructure | brokers | INFO, HIGH |
FailedFetchRequestPerSec | Rate of failed fetch requests is above threshold
|
Infrastructure | brokerID | INFO, HIGH, CRITICAL |
BrokerDiskUsage | Disk usage of broker is higher than average
|
Infrastructure | brokerID | INFO, MEDIUM |
BrokerStatus | Broker is offline
|
Infrastructure | brokerID | INFO, CRITICAL |
LicenseStatus | License is invalid
|
Infrastructure | lenses | CRITICAL |
TopicDataDelete | Records from topic were deleted
|
Infrastructure | topic | INFO |