Skip to content
Apache Kafka kf admin-ops 4 min read

Monitoring & Metrics

A Kafka cluster that you cannot see is a Kafka cluster that will surprise you at 3 a.m. Brokers expose hundreds of operational metrics through JMX, but JMX alone is hard to scrape, store, and alert on. The production-standard pattern is to bridge JMX into Prometheus with a Java agent, store the time series in Prometheus, visualize them in Grafana, and route the handful of metrics that actually predict outages into your alerting system. This page walks through that stack and the specific signals that matter.

The monitoring pipeline

The flow is the same on every broker, controller, and client in the cluster:

Kafka JVM (JMX MBeans)
   └─ jmx_exporter Java agent  →  HTTP :7071/metrics
        └─ Prometheus  (scrape + store + alert rules)
             ├─ Grafana      (dashboards)
             └─ Alertmanager (routing: PagerDuty, Slack, email)

The JMX exporter runs in-process as a -javaagent, so there is no separate sidecar to manage and no JMX RMI ports to secure. It reads MBeans and re-publishes them in Prometheus exposition format.

Wiring the JMX exporter into brokers

Download the jmx_prometheus_javaagent JAR and a config file, then attach it to every broker and controller via KAFKA_OPTS:

export KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent-1.0.1.jar=7071:/opt/jmx_exporter/kafka.yml"
bin/kafka-server-start.sh config/kraft/server.properties

The config file maps verbose JMX bean names to clean Prometheus metric names. A trimmed example:

lowercaseOutputName: true
rules:
  # Under-replicated and offline partitions, ISR shrink/expand rate
  - pattern: 'kafka.server<type=ReplicaManager, name=(\w+)><>Value'
    name: kafka_server_replicamanager_$1
  # Per-request latency percentiles
  - pattern: 'kafka.network<type=RequestMetrics, name=(\w+), request=(\w+)><>(\d+)thPercentile'
    name: kafka_network_request_$1
    labels:
      request: "$2"
      quantile: "0.$3"
  # Log size per partition (disk usage)
  - pattern: 'kafka.log<type=Log, name=Size, topic=(.+), partition=(.+)><>Value'
    name: kafka_log_size
    labels:
      topic: "$1"
      partition: "$2"

Tip: Keep the rule set tight. A wide-open exporter config can publish tens of thousands of series per broker, which inflates Prometheus cardinality and slows queries. Whitelist the beans you actually graph and alert on.

Confirm the endpoint is live before pointing Prometheus at it:

curl -s http://broker-1:7071/metrics | grep kafka_server_replicamanager

Output:

kafka_server_replicamanager_underreplicatedpartitions 0.0
kafka_server_replicamanager_offlinereplicacount 0.0
kafka_server_replicamanager_partitioncount 312.0

Prometheus scrape config

Point Prometheus at the exporter ports. Label each target with its role so you can split broker and controller dashboards:

scrape_configs:
  - job_name: kafka
    scrape_interval: 15s
    static_configs:
      - targets: ['broker-1:7071', 'broker-2:7071', 'broker-3:7071']
        labels:
          role: broker
      - targets: ['controller-1:7071']
        labels:
          role: controller

The metrics that matter

Hundreds of metrics are emitted; a small subset predicts real problems. These are the ones to graph prominently and alert on.

MetricWhat it tells youHealthy value
kafka_server_replicamanager_underreplicatedpartitionsReplicas not caught up to the leader0
kafka_controller_kafkacontroller_offlinepartitionscountPartitions with no active leader0
kafka_server_replicamanager_isrshrinkspersecFollowers falling out of the ISRNear 0
kafka_network_request_totaltimems (Produce/Fetch p99)End-to-end request latencyStable, low ms
kafka_server_brokertopicmetrics_bytesinpersecInbound throughput per brokerEven across brokers
kafka_log_sizeDisk consumed per partitionBelow capacity
Consumer lag (kafka_consumergroup_lag)How far consumers trail producersBounded, not growing

Under-replicated and offline partitions are the two non-negotiable alerts: any non-zero value means data is at risk or unavailable. ISR shrink rate is the early-warning signal — it spikes before partitions go under-replicated, often pointing at a slow disk or a network partition.

Alert rules

Encode the critical signals as Prometheus alerting rules with sensible for durations to avoid flapping:

groups:
  - name: kafka.rules
    rules:
      - alert: KafkaUnderReplicatedPartitions
        expr: kafka_server_replicamanager_underreplicatedpartitions > 0
        for: 3m
        labels: { severity: critical }
        annotations:
          summary: "Under-replicated partitions on {{ $labels.instance }}"

      - alert: KafkaOfflinePartitions
        expr: kafka_controller_kafkacontroller_offlinepartitionscount > 0
        for: 1m
        labels: { severity: critical }
        annotations:
          summary: "Offline partitions in the cluster"

      - alert: KafkaConsumerLagGrowing
        expr: sum by (consumergroup, topic) (kafka_consumergroup_lag) > 100000
        for: 10m
        labels: { severity: warning }
        annotations:
          summary: "Consumer group {{ $labels.consumergroup }} is lagging"

      - alert: KafkaDiskAlmostFull
        expr: (node_filesystem_avail_bytes{mountpoint="/var/lib/kafka"} / node_filesystem_size_bytes) < 0.15
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "Kafka data disk under 15% free on {{ $labels.instance }}"

Consumer lag is best collected by a dedicated exporter (such as Kafka Lag Exporter or Burrow) that polls committed offsets against log-end offsets, since lag is a consumer-side concept rather than a broker MBean.

Grafana dashboards

In Grafana, add Prometheus as a data source and import a community Kafka dashboard (for example Grafana dashboard ID 7589) as a starting point, then trim it to your environment. Organize panels into rows: cluster health (URP, offline, ISR), throughput (bytes/messages in/out per broker), latency (Produce/Fetch p99), and resources (disk, JVM heap, GC pause time). Use the instance and role labels as dashboard variables so a single dashboard serves every broker.

Best practices

  • Run the JMX exporter as an in-process Java agent rather than a remote JMX connection — it is simpler to secure and lower overhead.
  • Whitelist beans in the exporter config to control Prometheus cardinality; never publish per-partition series for thousands of partitions unless you need them.
  • Alert only on actionable, outage-predictive signals (URP, offline partitions, disk, lag) and keep informational metrics to dashboards.
  • Collect consumer lag with a dedicated exporter, not broker MBeans, and alert on sustained growth rather than absolute spikes.
  • Monitor the controller separately — offline partition count and active-controller count live on the controller, not the brokers.
  • Add JVM and host metrics (heap, GC pause, disk, network) alongside Kafka metrics; broker problems are frequently resource problems.
  • Use for: durations on alerts so transient blips during leader elections or rolling restarts do not page anyone.
Last updated June 1, 2026
Was this helpful?