Production Monitoring & Alerting

A Kafka cluster that you cannot observe is a Kafka cluster you cannot operate. Brokers fail silently, partitions drift out of sync, and consumer lag balloons long before anyone notices the downstream data is stale. A production monitoring stack turns those invisible failure modes into dashboards you can read and alerts that page the right person at 3 a.m. This page walks through a battle-tested observability setup: JMX exporters feeding Prometheus, Grafana dashboards for cluster/topic/consumer views, dedicated lag monitoring, log aggregation, and a tuned alert set that fires on signal and stays quiet on noise.

The metrics pipeline

Kafka exposes everything you need over JMX. The standard pattern is to run the Prometheus JMX Exporter as a Java agent inside each broker JVM, which scrapes MBeans and republishes them as Prometheus metrics on an HTTP endpoint. Prometheus then scrapes each broker on a fixed interval, and Grafana queries Prometheus for visualization.

Broker JVM (+ JMX Exporter agent :7071) ──┐
Broker JVM (+ JMX Exporter agent :7071) ──┼──► Prometheus ──► Grafana
Broker JVM (+ JMX Exporter agent :7071) ──┘        │
                                                   └──► Alertmanager ──► PagerDuty/Slack

Attach the agent and a scrape config rules file via the broker’s JVM options:

export KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx_exporter/kafka-broker.yml"

The exporter rules file maps verbose MBean names to clean Prometheus metric names. A minimal but useful subset:

# kafka-broker.yml
lowercaseOutputName: true
rules:
  - pattern: kafka.server<type=ReplicaManager, name=(UnderReplicatedPartitions)><>Value
    name: kafka_server_replicamanager_underreplicatedpartitions
  - pattern: kafka.controller<type=KafkaController, name=(OfflinePartitionsCount)><>Value
    name: kafka_controller_offlinepartitionscount
  - pattern: kafka.server<type=ReplicaManager, name=(IsrShrinksPerSec)><>Count
    name: kafka_server_isr_shrinks_total
  - pattern: kafka.server<type=BrokerTopicMetrics, name=(BytesInPerSec|BytesOutPerSec)><>Count
    name: kafka_server_brokertopicmetrics_$1_total

Point Prometheus at the brokers and tell it where to find your alert rules:

# prometheus.yml
global:
  scrape_interval: 15s
rule_files:
  - /etc/prometheus/kafka-alerts.yml
scrape_configs:
  - job_name: kafka-brokers
    static_configs:
      - targets: ["broker-0:7071", "broker-1:7071", "broker-2:7071"]
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

Grafana dashboards

Build (or import) three layers of dashboards so each persona finds what they need quickly.

Dashboard	Key panels	Audience
Cluster overview	Active controller count, online brokers, under-replicated/offline partitions, total bytes in/out	On-call SRE
Topic detail	Per-topic throughput, message rate, partition skew, log size	Platform team
Consumer group	Consumer lag per group/partition, commit rate, rebalance count	App owners

A good starting point is the community dashboard Kafka Exporter Overview (Grafana ID 7589) plus the JMX/Strimzi Kafka dashboards. Treat imported dashboards as a baseline and prune panels you do not understand — a dashboard full of metrics nobody reads is just clutter.

Always pin a single, large “controller count” panel on the cluster overview. A healthy KRaft cluster has exactly one active controller. Zero means an outage; more than one means split-brain.

Consumer lag monitoring

Broker metrics tell you the cluster is healthy; they do not tell you whether consumers are keeping up. Lag is the gap between the latest offset produced and the latest offset a group has committed. The two standard tools are Burrow (LinkedIn’s evaluation-based lag checker) and the Kafka Lag Exporter (which exposes lag directly to Prometheus and even estimates lag in time, not just message count).

Kafka Lag Exporter is the simpler choice for a Prometheus-native stack:

# application.conf for kafka-lag-exporter
kafka-lag-exporter {
  port = 8000
  poll-interval = 30 seconds
  clusters = [
    {
      name = "prod"
      bootstrap-brokers = "broker-0:9092,broker-1:9092,broker-2:9092"
    }
  ]
}

It publishes kafka_consumergroup_group_lag (in records) and kafka_consumergroup_group_lag_seconds (estimated time behind). The time-based metric is what you should alert on: “10,000 messages behind” is meaningless without throughput context, but “5 minutes behind” is universally understandable.

For ad-hoc investigation from the CLI:

kafka-consumer-groups.sh --bootstrap-server broker-0:9092 \
  --describe --group orders-service

Output:

GROUP          TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
orders-service orders   0          1048231         1048240         9
orders-service orders   1          1048190         1051002         2812

Log aggregation

Metrics show that something broke; logs show why. Ship broker logs (and ideally the controller and state-change logs) to a centralized store — the ELK/OpenSearch stack or Loki paired with the same Grafana you already run. With Loki, a Promtail agent tails the Kafka log directory:

# promtail-kafka.yml
scrape_configs:
  - job_name: kafka
    static_configs:
      - targets: [localhost]
        labels:
          job: kafka
          host: broker-0
          __path__: /var/log/kafka/*.log

Index on broker host and log level so you can pivot instantly from a “broker down” alert to that broker’s error stream. Watch for repeated ISR shrink, LeaderNotAvailable, and segment/disk errors — these usually precede a metrics-visible incident by minutes.

A tuned alert set

Alerts should map to operator action. Page on conditions that require a human; route everything else to a dashboard or a low-priority channel. Define them as Prometheus rules:

# kafka-alerts.yml
groups:
  - name: kafka
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Under-replicated partitions on {{ $labels.instance }}"

      - alert: KafkaOfflinePartitions
        expr: kafka_controller_offlinepartitionscount > 0
        for: 1m
        labels: { severity: critical }
        annotations:
          summary: "Offline partitions detected — data unavailable"

      - alert: KafkaConsumerLagSeconds
        expr: kafka_consumergroup_group_lag_seconds > 300
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Group {{ $labels.group }} is >5m behind"

      - alert: KafkaBrokerDiskFilling
        expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/kafka"} / node_filesystem_size_bytes) < 0.15
        for: 5m
        labels: { severity: warning }

      - alert: KafkaActiveControllerCount
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
        for: 1m
        labels: { severity: critical }

The reference signals every Kafka on-call should know:

Signal	Healthy value	Why it matters
Under-replicated partitions	0	Replication is falling behind; one more failure risks data loss
Offline partitions	0	Partition has no leader — producers and consumers are blocked
Active controller count	1	0 = no controller; >1 = split-brain
ISR shrink rate	~0	Frequent shrinks signal flaky brokers or network
Consumer lag (seconds)	< SLA	Downstream data is stale
Disk free	> 15%	A full log dir takes the broker offline

Use the for: clause aggressively. A momentary under-replicated count during a rolling restart is normal; sustaining it for five minutes is not. Without for:, you train your team to ignore the pager.

Best Practices

Alert in time units for lag (group_lag_seconds), not raw message counts — it is throughput-independent and SLA-aligned.
Run the JMX exporter as an in-JVM agent rather than a separate sidecar; it is lower-latency and one less moving part.
Keep critical alerts to a short, ruthlessly curated list (offline/under-replicated partitions, controller count, disk). Everything else is a warning or a dashboard.
Co-locate metrics and logs in the same Grafana so on-call can pivot from “broker down” to “broker logs” in one click.
Always alert on a single active controller — it is the cheapest, highest-signal cluster-health check you have.
Test your alerting path end to end (fire a synthetic alert and confirm the page lands) before you rely on it in production.
Set Prometheus retention and downsampling so historical capacity trends survive long enough to inform planning.