Production Monitoring & Alerting
A Kafka cluster that you cannot observe is a Kafka cluster you cannot operate. Brokers fail silently, partitions drift out of sync, and consumer lag balloons long before anyone notices the downstream data is stale. A production monitoring stack turns those invisible failure modes into dashboards you can read and alerts that page the right person at 3 a.m. This page walks through a battle-tested observability setup: JMX exporters feeding Prometheus, Grafana dashboards for cluster/topic/consumer views, dedicated lag monitoring, log aggregation, and a tuned alert set that fires on signal and stays quiet on noise.
The metrics pipeline
Kafka exposes everything you need over JMX. The standard pattern is to run the Prometheus JMX Exporter as a Java agent inside each broker JVM, which scrapes MBeans and republishes them as Prometheus metrics on an HTTP endpoint. Prometheus then scrapes each broker on a fixed interval, and Grafana queries Prometheus for visualization.
Broker JVM (+ JMX Exporter agent :7071) ──┐
Broker JVM (+ JMX Exporter agent :7071) ──┼──► Prometheus ──► Grafana
Broker JVM (+ JMX Exporter agent :7071) ──┘ │
└──► Alertmanager ──► PagerDuty/Slack
Attach the agent and a scrape config rules file via the broker’s JVM options:
export KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx_exporter/kafka-broker.yml"
The exporter rules file maps verbose MBean names to clean Prometheus metric names. A minimal but useful subset:
# kafka-broker.yml
lowercaseOutputName: true
rules:
- pattern: kafka.server<type=ReplicaManager, name=(UnderReplicatedPartitions)><>Value
name: kafka_server_replicamanager_underreplicatedpartitions
- pattern: kafka.controller<type=KafkaController, name=(OfflinePartitionsCount)><>Value
name: kafka_controller_offlinepartitionscount
- pattern: kafka.server<type=ReplicaManager, name=(IsrShrinksPerSec)><>Count
name: kafka_server_isr_shrinks_total
- pattern: kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec)><>Count
name: kafka_server_brokertopicmetrics_$1_total
Point Prometheus at the brokers and tell it where to find your alert rules:
# prometheus.yml
global:
scrape_interval: 15s
rule_files:
- /etc/prometheus/kafka-alerts.yml
scrape_configs:
- job_name: kafka-brokers
static_configs:
- targets: ["broker-0:7071", "broker-1:7071", "broker-2:7071"]
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
Grafana dashboards
Build (or import) three layers of dashboards so each persona finds what they need quickly.
| Dashboard | Key panels | Audience |
|---|---|---|
| Cluster overview | Active controller count, online brokers, under-replicated/offline partitions, total bytes in/out | On-call SRE |
| Topic detail | Per-topic throughput, message rate, partition skew, log size | Platform team |
| Consumer group | Consumer lag per group/partition, commit rate, rebalance count | App owners |
A good starting point is the community dashboard Kafka Exporter Overview (Grafana ID 7589) plus the JMX/Strimzi Kafka dashboards. Treat imported dashboards as a baseline and prune panels you do not understand — a dashboard full of metrics nobody reads is just clutter.
Always pin a single, large “controller count” panel on the cluster overview. A healthy KRaft cluster has exactly one active controller. Zero means an outage; more than one means split-brain.
Consumer lag monitoring
Broker metrics tell you the cluster is healthy; they do not tell you whether consumers are keeping up. Lag is the gap between the latest offset produced and the latest offset a group has committed. The two standard tools are Burrow (LinkedIn’s evaluation-based lag checker) and the Kafka Lag Exporter (which exposes lag directly to Prometheus and even estimates lag in time, not just message count).
Kafka Lag Exporter is the simpler choice for a Prometheus-native stack:
# application.conf for kafka-lag-exporter
kafka-lag-exporter {
port = 8000
poll-interval = 30 seconds
clusters = [
{
name = "prod"
bootstrap-brokers = "broker-0:9092,broker-1:9092,broker-2:9092"
}
]
}
It publishes kafka_consumergroup_group_lag (in records) and kafka_consumergroup_group_lag_seconds (estimated time behind). The time-based metric is what you should alert on: “10,000 messages behind” is meaningless without throughput context, but “5 minutes behind” is universally understandable.
For ad-hoc investigation from the CLI:
kafka-consumer-groups.sh --bootstrap-server broker-0:9092 \
--describe --group orders-service
Output:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
orders-service orders 0 1048231 1048240 9
orders-service orders 1 1048190 1051002 2812
Log aggregation
Metrics show that something broke; logs show why. Ship broker logs (and ideally the controller and state-change logs) to a centralized store — the ELK/OpenSearch stack or Loki paired with the same Grafana you already run. With Loki, a Promtail agent tails the Kafka log directory:
# promtail-kafka.yml
scrape_configs:
- job_name: kafka
static_configs:
- targets: [localhost]
labels:
job: kafka
host: broker-0
__path__: /var/log/kafka/*.log
Index on broker host and log level so you can pivot instantly from a “broker down” alert to that broker’s error stream. Watch for repeated ISR shrink, LeaderNotAvailable, and segment/disk errors — these usually precede a metrics-visible incident by minutes.
A tuned alert set
Alerts should map to operator action. Page on conditions that require a human; route everything else to a dashboard or a low-priority channel. Define them as Prometheus rules:
# kafka-alerts.yml
groups:
- name: kafka
rules:
- alert: KafkaUnderReplicatedPartitions
expr: kafka_server_replicamanager_underreplicatedpartitions > 0
for: 5m
labels: { severity: critical }
annotations:
summary: "Under-replicated partitions on {{ $labels.instance }}"
- alert: KafkaOfflinePartitions
expr: kafka_controller_offlinepartitionscount > 0
for: 1m
labels: { severity: critical }
annotations:
summary: "Offline partitions detected — data unavailable"
- alert: KafkaConsumerLagSeconds
expr: kafka_consumergroup_group_lag_seconds > 300
for: 10m
labels: { severity: warning }
annotations:
summary: "Group {{ $labels.group }} is >5m behind"
- alert: KafkaBrokerDiskFilling
expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/kafka"} / node_filesystem_size_bytes) < 0.15
for: 5m
labels: { severity: warning }
- alert: KafkaActiveControllerCount
expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
for: 1m
labels: { severity: critical }
The reference signals every Kafka on-call should know:
| Signal | Healthy value | Why it matters |
|---|---|---|
| Under-replicated partitions | 0 | Replication is falling behind; one more failure risks data loss |
| Offline partitions | 0 | Partition has no leader — producers and consumers are blocked |
| Active controller count | 1 | 0 = no controller; >1 = split-brain |
| ISR shrink rate | ~0 | Frequent shrinks signal flaky brokers or network |
| Consumer lag (seconds) | < SLA | Downstream data is stale |
| Disk free | > 15% | A full log dir takes the broker offline |
Use the
for:clause aggressively. A momentary under-replicated count during a rolling restart is normal; sustaining it for five minutes is not. Withoutfor:, you train your team to ignore the pager.
Best Practices
- Alert in time units for lag (
group_lag_seconds), not raw message counts — it is throughput-independent and SLA-aligned. - Run the JMX exporter as an in-JVM agent rather than a separate sidecar; it is lower-latency and one less moving part.
- Keep critical alerts to a short, ruthlessly curated list (offline/under-replicated partitions, controller count, disk). Everything else is a warning or a dashboard.
- Co-locate metrics and logs in the same Grafana so on-call can pivot from “broker down” to “broker logs” in one click.
- Always alert on a single active controller — it is the cheapest, highest-signal cluster-health check you have.
- Test your alerting path end to end (fire a synthetic alert and confirm the page lands) before you rely on it in production.
- Set Prometheus retention and downsampling so historical capacity trends survive long enough to inform planning.