Consumer Rebalancing
A rebalance is the process by which Kafka redistributes a topic’s partitions across the live members of a consumer group. It is the mechanism that gives consumer groups their elasticity and fault tolerance: add a consumer and it picks up work, kill one and its partitions are reassigned elsewhere. But the classic protocol is “stop-the-world” — every consumer pauses processing while the group reconfigures — so a group that rebalances too often spends more time coordinating than consuming. Understanding what triggers a rebalance, and how to react to one cleanly, is essential to running a healthy consumer fleet in production.
What triggers a rebalance
A rebalance is started whenever the group’s membership or the topic’s shape changes. The common triggers are:
- A member joins. A new consumer with the same
group.idstarts up and sends aJoinGrouprequest. - A member leaves gracefully. A consumer calls
close(), which sends aLeaveGrouprequest so the coordinator reassigns its partitions immediately. - A member is presumed dead. The consumer stops sending heartbeats and
session.timeout.mselapses, or it fails to callpoll()withinmax.poll.interval.ms(it is “stuck” processing a batch). - Topic metadata changes. A subscribed topic has partitions added, or a new topic matching a subscribed pattern (
subscribe(Pattern)) appears or disappears.
Hitting
max.poll.interval.msis the most common self-inflicted rebalance. If your processing of a singlepoll()batch is slow, the consumer is kicked from the group, triggers a rebalance, then rejoins — often in a loop. Reducemax.poll.recordsor move slow work off the poll thread.
The group coordinator
Every consumer group is managed by a group coordinator — a specific broker chosen by hashing the group.id onto a partition of the internal __consumer_offsets topic. The coordinator tracks membership, receives heartbeats, runs the rebalance protocol, and stores committed offsets.
When a rebalance fires, one member is elected the group leader. The coordinator collects the subscriptions of all members, hands them to the leader, and the leader runs the configured partition.assignment.strategy to compute who gets which partitions. The coordinator then distributes that plan back to every member in the SyncGroup response. The assignment logic itself runs client-side; the coordinator orchestrates the rounds.
The eager (stop-the-world) protocol
The original protocol is called eager rebalancing. Its defining property is that, at the start of every rebalance, all members revoke all of their partitions before the new assignment is computed. The flow is:
trigger -> all consumers revoke ALL partitions (processing stops)
-> every member sends JoinGroup
-> leader computes assignment
-> SyncGroup distributes new ownership
-> consumers resume polling assigned partitions
Because nobody owns any partition during the join/sync window, the whole group is idle — hence “stop-the-world.” For a small group rebalancing rarely, the pause is a non-event. For a large group, or one churning every few seconds, the cumulative downtime is significant and lag spikes follow each rebalance.
Reacting with ConsumerRebalanceListener
Your application gets a chance to run code at the revoke and assign boundaries via a ConsumerRebalanceListener. This is where you commit offsets for partitions you are losing (so the new owner does not reprocess) and reset any per-partition state for partitions you are gaining.
import org.apache.kafka.clients.consumer.ConsumerRebalanceListener;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.TopicPartition;
import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
public class CommitOnRevokeListener implements ConsumerRebalanceListener {
private final KafkaConsumer<String, String> consumer;
private final Map<TopicPartition, OffsetAndMetadata> pending;
public CommitOnRevokeListener(KafkaConsumer<String, String> consumer,
Map<TopicPartition, OffsetAndMetadata> pending) {
this.consumer = consumer;
this.pending = pending;
}
@Override
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// Last chance before losing ownership: commit what we have processed.
consumer.commitSync(new HashMap<>(pending));
}
@Override
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// Newly owned partitions: warm caches, seek, or just start consuming.
partitions.forEach(tp ->
System.out.println("Now owning " + tp));
}
}
You attach it at subscribe time:
consumer.subscribe(java.util.List.of("orders"), new CommitOnRevokeListener(consumer, pending));
In Spring for Apache Kafka, the same callbacks are exposed by implementing ConsumerAwareRebalanceListener and registering it on the container factory:
import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.common.TopicPartition;
import org.springframework.kafka.listener.ConsumerAwareRebalanceListener;
import org.springframework.stereotype.Component;
import java.util.Collection;
@Component
public class OrderRebalanceListener implements ConsumerAwareRebalanceListener {
@Override
public void onPartitionsRevokedBeforeCommit(Consumer<?, ?> consumer,
Collection<TopicPartition> partitions) {
// Spring commits after this returns; do cleanup here.
}
@Override
public void onPartitionsAssigned(Consumer<?, ?> consumer,
Collection<TopicPartition> partitions) {
// Optionally seek or initialize per-partition state.
}
}
Why frequent rebalances hurt
Each rebalance pays a fixed coordination cost and, under the eager protocol, idles the entire group. The downstream effects compound:
| Symptom | Cause during rebalance |
|---|---|
| Lag spikes | No partition is consumed during the join/sync window |
| Duplicate processing | Offsets not committed before revoke, so the new owner replays |
| Throughput collapse | A “rebalance storm” loops join -> revoke -> join |
| Cold caches | New owner must rebuild per-partition state from scratch |
The two most effective ways to make rebalances cheaper and rarer are the topics that follow this page. Cooperative rebalancing changes the protocol so members keep the partitions they are retaining and only the moved partitions are paused, eliminating the stop-the-world pause. Static group membership (group.instance.id) lets a consumer restart — say during a rolling deploy — without triggering a rebalance at all, because the coordinator remembers its identity for session.timeout.ms.
Best Practices
- Keep per-batch processing well under
max.poll.interval.ms; tunemax.poll.recordsdown before you tune the timeout up. - Always call
consumer.close()(or let Spring’s container shut down cleanly) so the group gets a promptLeaveGroupinstead of waiting for the session timeout. - Commit offsets in
onPartitionsRevoked/onPartitionsRevokedBeforeCommitto avoid duplicate processing after reassignment. - Adopt the
CooperativeStickyAssignorto avoid stop-the-world pauses on large or churning groups. - Use
group.instance.id(static membership) for consumers that restart frequently during deploys to suppress unnecessary rebalances. - Monitor rebalance frequency and duration via consumer metrics (
rebalance-rate-per-hour,rebalance-latency-avg) and alert on storms. - Size
session.timeout.msandheartbeat.interval.mstogether — roughly a 3:1 ratio — so transient pauses do not eject healthy members.