Alertmanager Integration

Lenses comes with an alerting subsystem that can be tailored to match individual needs. You can find more information about the alerting subsystem at the user guide section.

For an alerting system to be complete, there is usually the requirement for alerts management and notifications. In simple terms, there has to be a way for an alert to reach the proper team within the proper time frame. And all of this without overwhelming the team with alerts which does not involve it, duplicate the entries, or produce alerts which are a byproduct of a top level alert. Hence, Lenses integrates with the Alertmanager software which provides alerts de-duplication, grouping and routing via various systems (such as email, pager-duty, slack) as well as silencing and inhibition.

Lenses Configuration

To configure the Alertmanager integration, just a couple configuration entries are required. The only mandatory option is the Alertmanager endpoint setting. Alertmanager clusters (multiple endpoints) are well supported:

lenses.alert.manager.endpoints="http://alertmanager.1.url:9093,http://alertmanager.2.url:9093"

A useful option is the generator URL, where the Lenses address should be set. This way alerts will include a link to Lenses, which the recipient can use to quickly navigate to the web interface:

lenses.alert.manager.generator.url="http://lenses.url:9991"

For the complete list of reference options, please visit the configuration reference.

../../_images/l-am-ss.png

Lenses® | Alertmanager Service Status

Alerts Attributes

Although Alertmanager inner workings and configuration is beyond the scope of this guide, it is useful to briefly go into some of the details. This way it will be easier to get the most out of this feature.

Alerts are posted to Alertmanager as JSON objects. Each alert has a set of labels and a set of annotations. The set of labels is what uniquely identifies the alert, whilst the annotations serve as further elaborate descriptions of the event.

Alertmanager can use the set of labels in order to deduplicate, group, route, silence and inhibit alerts, whilst the annotations (and a field called generatorURL) can be sent, along with the labels, to a recipient to help quickly understand the issue.

Lenses alerts offer these main labels [1]:

label name description values
category the category of the alert Infrastructure, Consumers, Kafka Connect, Topics
instance the URL or the subsystem that triggered the event can be the address of a broker, a description like UnderReplication, etc
severity the severity of the event INFO, MEDIUM, HIGH, CRITICAL [2]
[1]There are more labels actually but vary by instance. Only these three are present in all alerts and can be considered stable, so that the alertmanager configuration may be built around them.
[2]There is also a LOW level but it is not in use currently.

Lenses alerts have these annotations:

annotation name description values
source the source of the event default is Lenses unless configured otherwise
summary the summary of the event depends on the alert

Alertmanager Example

In the example below, Alertmanager is configured with three receivers: default, urgent and emergency. There are three backends available as well: email, slack and pushover. The default receiver sends events only to slack. The urgent receiver sends events to both email and slack, whilst the emergency receiver sends to all three backends.

The routing rules send Infrastructure category events of HIGH or CRITICAL severity to the emergency receiver so the team can receive push notifications to their mobile phones and act immediately. The rest of the categories of events of HIGH or CRITICAL severity are sent to the urgent receiver, so the proper team member can get email notifications. At last, all other events (events that didn’t match a routing rule) will be sent to the default receiver, which will post them to a slack channel, where a member of the team can look at a time of convenience.

Also two inhibition rules are set. If a CRITICAL alert is triggered, Alertmanager will not sent notifications for any other events until the CRITICAL issue is resolved. This is because the main problem (maybe a broker can no more serve requests) will cause more problems. Team members should not be flooded with notifications but rather get one notification for the root cause. The second inhibition rule applies to events of severity HIGH. In that case events from the same instance but with lower severity will be inhibited until the main alert for this instance is resolved.

For the Slack notification a custom text is set, which includes the summary, source and generatorURL.

global:
  slack_api_url: https://hooks.slack.com/services/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
  smtp_from: alertmanager@example.com
  smtp_smarthost: smtp.example.com:25
  smtp_auth_username: SMTP_USER
  smtp_auth_password: SMTP_PASS

route:
  # If an alert does not match any rule, it goes to the default:
  receiver: 'default'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [category,severity]
  routes:
  - receiver: emergency
    match_re:
      severity: HIGH|CRITICAL
      category: Infrastructure
  - receiver: urgent
    match:
      severity: HIGH|CRITICAL

inhibit_rules:
- source_match:
    severity: 'CRITICAL'
  target_match_re:
    severity: 'INFO|MEDIUM|HIGH'
  equal: ['source']
- source_match:
    severity: 'HIGH'
  target_match_re:
    severity: 'INFO|MEDIUM'
  equal: ['instance','source']

receivers:
- name: 'default'
  slack_configs:
  - channel: alerts
    send_resolved: true
    text: {% raw %}"{{ range .Alerts }}{{ .Labels.instance }}: {{ .Annotations.summary }}.\nVia: {{ .Annotations.source }}\nGenerator: {{ .GeneratorURL }}\n{{ end }}"{% endraw %}

- name: 'urgent'
  email_configs:
  - to: 'user1@example.com, user2@example.com, user3@example.com'
  slack_configs:
  - channel: alerts
    send_resolved: true
    text: {% raw %}"{{ range .Alerts }}{{ .Labels.instance }}: {{ .Annotations.summary }}.\nVia: {{ .Annotations.source }}\nGenerator: {{ .GeneratorURL }}\n{{ end }}"{% endraw %}

- name: 'emergency'
  pushover_configs:
  - user_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    token: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    expire: 2m
  email_configs:
  - to: 'user1@example.com, user2@example.com, user3@example.com'
  slack_configs:
  - channel: alerts
    send_resolved: true
    text: {% raw %}"{{ range .Alerts }}{{ .Labels.instance }}: {{ .Annotations.summary }}\nVia: {{ .Annotations.source }}\nGenerator: {{ .GeneratorURL }}\n{{ end }}"{% endraw %}

Alerts Reference

Each alert is configured to fire on multiple criteria. Depending on the conditions, a number of tags is applied to each alert (for example topic name) and the description and severity of the alert is changed appropriately. The list of alerts Lenses can produce, along with the category, instance and severity can be found below.

Please note that alerts of severity INFO are not send to Alertmanager.

Alerts Table
Alert Description Category Instance Severity
TopicAdded
New topic was added
Topics topic INFO
TopicDelete
Topic was deleted
Topics topic INFO
ConnectorDeleted
Connector was deleted
KafkaConnector connector name INFO
SchemaRegistryStatus
Status of Schema Registry
Infrastructure service URL INFO, HIGH
UnderReplicatedPartitions
Some partitions are under replicated
Infrastructure partitions INFO, HIGH
FailedProduceRequestPerSec
Rate of failed requests is above threshold
Infrastructure brokerID INFO, HIGH
ConsumerLag
A consumer group is falling behind
Consumers topic INFO, HIGH
LeaderImbalance
A broker has a large number of leader replicas
Infrastructure brokerID INFO
FileOpenDescriptorsCapacity
A broker has too many open file descriptors
Infrastructure brokerID INFO, HIGH, CRITICAL
ActiveControllers
High number of active controllers
Infrastructure brokers HIGH, INFO
ConnectStatus
Connect client has gone offline
Infrastructure worker URL MEDIUM
ZookeeperStatus
Zookeeper node is offline
Infrastructure service name INFO, CRITICAL
RequestHandlerAvgIdlePercent
Broker is almost fully utilized
Infrastructure brokerID INFO, HIGH, CRITICAL
MultipleBrokerVersions
Brokers version mismatch
Infrastructure brokers versions INFO, HIGH
PartitionsOffline
Some partitions are offline
Infrastructure brokers INFO, HIGH
FailedFetchRequestPerSec
Rate of failed fetch requests is above threshold
Infrastructure brokerID INFO, HIGH, CRITICAL
BrokerDiskUsage
Disk usage of broker is higher than average
Infrastructure brokerID INFO, MEDIUM
BrokerStatus
Broker is offline
Infrastructure brokerID INFO, CRITICAL
LicenseStatus
License is invalid
Infrastructure lenses CRITICAL
TopicDataDelete
Records from topic were deleted
Infrastructure topic INFO