Monitoring Performance

You cannot tune what you cannot see. Every Kafka client and broker exposes a rich set of metrics over JMX, and a production deployment lives or dies on whether you are watching the right ones. The goal of monitoring is not to collect every number Kafka emits — it is to track the handful of signals that tell you when throughput is collapsing, latency is creeping up, or a consumer is falling behind. This page covers the metrics that matter on each side of the cluster and how to scrape them into Prometheus.

Where Kafka metrics come from

Brokers, producers, and consumers all publish metrics through Java Management Extensions (JMX). Each metric has an MBean object name (a domain plus key-value attributes) and one or more numeric attributes. You can browse them interactively with jconsole or jmxterm, but in production you scrape them with an agent. The Kafka clients also expose the same metrics programmatically, which is handy for embedding into application dashboards.

The metrics naturally split into four groups: producer-side, consumer-side, broker-side, and the end-to-end latency you measure yourself. Treat them as a layered system — a broker problem shows up as elevated producer request latency, which shows up as consumer lag downstream.

Producer metrics

The producer’s job is to get records to the broker quickly and reliably. The two headline signals are how fast it is sending and how long each request takes.

MBean attribute	What it tells you
`record-send-rate`	Records sent per second — your producer throughput
`request-latency-avg` / `request-latency-max`	Time from sending a produce request to receiving the ack
`record-error-rate`	Records that failed to send — should be zero
`record-queue-time-avg`	Time records spend buffered before being sent (batching pressure)
`compression-rate-avg`	Effective compression ratio of batches

A rising request-latency-avg with healthy record-send-rate usually points at the broker (slow disk, replication pressure), not the producer. A rising record-queue-time-avg means the producer is buffering — increase batch.size/linger.ms headroom or add producer threads.

The producer MBean domain is kafka.producer:type=producer-metrics,client-id=<id>. Always set a stable, descriptive client.id — otherwise your metrics are keyed by an auto-generated string that changes on restart and breaks dashboards.

Consumer metrics

For consumers, the single most important number is lag: how many records have been produced to a partition that the consumer has not yet processed. Sustained, growing lag means consumption cannot keep up with production.

MBean attribute	What it tells you
`records-lag-max`	Maximum lag across all assigned partitions
`records-consumed-rate`	Records consumed per second
`fetch-latency-avg`	Time to complete a fetch request
`commit-latency-avg`	Time to commit offsets
`rebalance-rate-per-hour`	How often the group rebalances — high values stall consumption

The client-reported records-lag-max lives under kafka.consumer:type=consumer-fetch-manager-metrics. For an authoritative, broker-side view across an entire group, query the committed offsets instead:

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group order-processor

Output:

GROUP           TOPIC    PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
order-processor orders   0          1048576         1048700         124
order-processor orders   1          1050010         1050010         0
order-processor orders   2          1049998         1052300         2302

The LAG column is the difference between the log end offset and the consumer’s committed offset. A few hundred records is normal churn; thousands that keep climbing is a problem.

Broker metrics

Broker health determines the ceiling for everything else. Two metrics deserve a dedicated alert.

Under-replicated partitions — kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions. This is the number of partitions whose in-sync replica (ISR) set is smaller than the configured replication factor. The correct value is always 0. Anything above zero means a replica has fallen behind or a broker is down, putting durability at risk.
Request handler idle ratio — kafka.server:type=KafkaRequestHandlerPool,name=RequestHandlerAvgIdlePercent. This is the fraction of time the broker’s I/O threads sit idle, between 0 and 1. A value approaching 0 means the broker is saturated; you need more num.io.threads or less load. Alert when it drops below roughly 0.3.

Also watch kafka.network:type=RequestMetrics,name=TotalTimeMs,request=Produce|FetchConsumer for per-request-type latency broken down into queue, local, and remote time — invaluable for finding where a slow request is spending its time.

End-to-end latency

The metrics above measure individual hops. What your users actually feel is end-to-end latency: the wall-clock time from when a record is produced to when it is processed by a consumer. Measure it explicitly by stamping the producer time into the record and comparing on consume.

@KafkaListener(topics = "orders", groupId = "latency-probe")
public void onMessage(ConsumerRecord<String, OrderEvent> record) {
    long endToEndMs = System.currentTimeMillis() - record.timestamp();
    Metrics.timer("kafka.e2e.latency")
           .record(endToEndMs, TimeUnit.MILLISECONDS);
}

public record OrderEvent(String orderId, BigDecimal amount) {}

Using record.timestamp() relies on the producer’s CreateTime, so no extra payload is needed. Feed the result into Micrometer and you get a latency histogram alongside the rest of your application metrics.

Scraping with Prometheus

The standard pattern is the JMX Exporter running as a Java agent on each broker. It translates JMX MBeans into Prometheus text format on an HTTP endpoint.

KAFKA_OPTS="-javaagent:/opt/jmx_exporter/jmx_prometheus_javaagent.jar=7071:/opt/jmx_exporter/kafka.yml"

A minimal exporter config that whitelists the metrics that matter keeps cardinality under control:

lowercaseOutputName: true
rules:
  - pattern: "kafka.server<type=ReplicaManager, name=UnderReplicatedPartitions><>Value"
    name: kafka_under_replicated_partitions
  - pattern: "kafka.server<type=KafkaRequestHandlerPool, name=RequestHandlerAvgIdlePercent><>OneMinuteRate"
    name: kafka_request_handler_idle_ratio
  - pattern: "kafka.network<type=RequestMetrics, name=TotalTimeMs, request=(.+)><>Mean"
    name: kafka_request_total_time_ms
    labels:
      request: "$1"

Then point Prometheus at the exporter endpoints:

scrape_configs:
  - job_name: kafka-brokers
    static_configs:
      - targets: ["broker-1:7071", "broker-2:7071", "broker-3:7071"]

For Spring Boot clients, prefer Micrometer — adding the Prometheus registry auto-binds Kafka client metrics to /actuator/prometheus, so you do not need a JMX agent in your app:

management:
  endpoints:
    web:
      exposure:
        include: prometheus
  metrics:
    tags:
      application: order-service

Best Practices

Alert on under-replicated partitions > 0 and request handler idle ratio < 0.3 — these are your two leading indicators of broker trouble.
Track consumer lag from the broker side (committed offsets), not just the client-reported records-lag-max, so a crashed consumer still shows up.
Always set a stable client.id/group.id so metric series survive restarts and remain comparable across dashboards.
Whitelist exporter rules instead of scraping every MBean — Kafka emits per-topic, per-partition metrics that explode Prometheus cardinality.
Measure end-to-end latency as a first-class metric; hop-by-hop numbers can all look healthy while the user-visible delay grows.
Watch rebalance-rate-per-hour — frequent rebalances quietly destroy throughput and are easy to miss.