Consumer Rebalancing

A rebalance is the process by which Kafka redistributes a topic’s partitions across the live members of a consumer group. It is the mechanism that gives consumer groups their elasticity and fault tolerance: add a consumer and it picks up work, kill one and its partitions are reassigned elsewhere. But the classic protocol is “stop-the-world” — every consumer pauses processing while the group reconfigures — so a group that rebalances too often spends more time coordinating than consuming. Understanding what triggers a rebalance, and how to react to one cleanly, is essential to running a healthy consumer fleet in production.

What triggers a rebalance

A rebalance is started whenever the group’s membership or the topic’s shape changes. The common triggers are:

A member joins. A new consumer with the same group.id starts up and sends a JoinGroup request.
A member leaves gracefully. A consumer calls close(), which sends a LeaveGroup request so the coordinator reassigns its partitions immediately.
A member is presumed dead. The consumer stops sending heartbeats and session.timeout.ms elapses, or it fails to call poll() within max.poll.interval.ms (it is “stuck” processing a batch).
Topic metadata changes. A subscribed topic has partitions added, or a new topic matching a subscribed pattern (subscribe(Pattern)) appears or disappears.

Hitting max.poll.interval.ms is the most common self-inflicted rebalance. If your processing of a single poll() batch is slow, the consumer is kicked from the group, triggers a rebalance, then rejoins — often in a loop. Reduce max.poll.records or move slow work off the poll thread.

The group coordinator

Every consumer group is managed by a group coordinator — a specific broker chosen by hashing the group.id onto a partition of the internal __consumer_offsets topic. The coordinator tracks membership, receives heartbeats, runs the rebalance protocol, and stores committed offsets.

When a rebalance fires, one member is elected the group leader. The coordinator collects the subscriptions of all members, hands them to the leader, and the leader runs the configured partition.assignment.strategy to compute who gets which partitions. The coordinator then distributes that plan back to every member in the SyncGroup response. The assignment logic itself runs client-side; the coordinator orchestrates the rounds.

The eager (stop-the-world) protocol

The original protocol is called eager rebalancing. Its defining property is that, at the start of every rebalance, all members revoke all of their partitions before the new assignment is computed. The flow is:

trigger -> all consumers revoke ALL partitions (processing stops)
        -> every member sends JoinGroup
        -> leader computes assignment
        -> SyncGroup distributes new ownership
        -> consumers resume polling assigned partitions

Because nobody owns any partition during the join/sync window, the whole group is idle — hence “stop-the-world.” For a small group rebalancing rarely, the pause is a non-event. For a large group, or one churning every few seconds, the cumulative downtime is significant and lag spikes follow each rebalance.

Reacting with ConsumerRebalanceListener

Your application gets a chance to run code at the revoke and assign boundaries via a ConsumerRebalanceListener. This is where you commit offsets for partitions you are losing (so the new owner does not reprocess) and reset any per-partition state for partitions you are gaining.

import org.apache.kafka.clients.consumer.ConsumerRebalanceListener;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.clients.consumer.OffsetAndMetadata;
import org.apache.kafka.common.TopicPartition;

import java.util.Collection;
import java.util.HashMap;
import java.util.Map;

public class CommitOnRevokeListener implements ConsumerRebalanceListener {

    private final KafkaConsumer<String, String> consumer;
    private final Map<TopicPartition, OffsetAndMetadata> pending;

    public CommitOnRevokeListener(KafkaConsumer<String, String> consumer,
                                  Map<TopicPartition, OffsetAndMetadata> pending) {
        this.consumer = consumer;
        this.pending = pending;
    }

    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Last chance before losing ownership: commit what we have processed.
        consumer.commitSync(new HashMap<>(pending));
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Newly owned partitions: warm caches, seek, or just start consuming.
        partitions.forEach(tp ->
            System.out.println("Now owning " + tp));
    }
}

You attach it at subscribe time:

consumer.subscribe(java.util.List.of("orders"), new CommitOnRevokeListener(consumer, pending));

In Spring for Apache Kafka, the same callbacks are exposed by implementing ConsumerAwareRebalanceListener and registering it on the container factory:

import org.apache.kafka.clients.consumer.Consumer;
import org.apache.kafka.common.TopicPartition;
import org.springframework.kafka.listener.ConsumerAwareRebalanceListener;
import org.springframework.stereotype.Component;

import java.util.Collection;

@Component
public class OrderRebalanceListener implements ConsumerAwareRebalanceListener {

    @Override
    public void onPartitionsRevokedBeforeCommit(Consumer<?, ?> consumer,
                                                Collection<TopicPartition> partitions) {
        // Spring commits after this returns; do cleanup here.
    }

    @Override
    public void onPartitionsAssigned(Consumer<?, ?> consumer,
                                     Collection<TopicPartition> partitions) {
        // Optionally seek or initialize per-partition state.
    }
}

Why frequent rebalances hurt

Each rebalance pays a fixed coordination cost and, under the eager protocol, idles the entire group. The downstream effects compound:

Symptom	Cause during rebalance
Lag spikes	No partition is consumed during the join/sync window
Duplicate processing	Offsets not committed before revoke, so the new owner replays
Throughput collapse	A “rebalance storm” loops join -> revoke -> join
Cold caches	New owner must rebuild per-partition state from scratch

The two most effective ways to make rebalances cheaper and rarer are the topics that follow this page. Cooperative rebalancing changes the protocol so members keep the partitions they are retaining and only the moved partitions are paused, eliminating the stop-the-world pause. Static group membership (group.instance.id) lets a consumer restart — say during a rolling deploy — without triggering a rebalance at all, because the coordinator remembers its identity for session.timeout.ms.

Best Practices

Keep per-batch processing well under max.poll.interval.ms; tune max.poll.records down before you tune the timeout up.
Always call consumer.close() (or let Spring’s container shut down cleanly) so the group gets a prompt LeaveGroup instead of waiting for the session timeout.
Commit offsets in onPartitionsRevoked / onPartitionsRevokedBeforeCommit to avoid duplicate processing after reassignment.
Adopt the CooperativeStickyAssignor to avoid stop-the-world pauses on large or churning groups.
Use group.instance.id (static membership) for consumers that restart frequently during deploys to suppress unnecessary rebalances.
Monitor rebalance frequency and duration via consumer metrics (rebalance-rate-per-hour, rebalance-latency-avg) and alert on storms.
Size session.timeout.ms and heartbeat.interval.ms together — roughly a 3:1 ratio — so transient pauses do not eject healthy members.