Cluster Upgrades

Upgrading a Kafka cluster to a new version is a routine but high-stakes operation: done carelessly it causes broker downtime, stuck replication, or incompatible clients; done correctly it is invisible to producers and consumers. The key to a zero-downtime upgrade is the rolling restart — replacing one broker at a time while the rest of the cluster keeps serving traffic — combined with careful handling of the inter.broker.protocol.version (IBP) handshake. This page walks through the full procedure, the protocol-version mechanics that make it safe, client compatibility, and a step-by-step checklist you can follow in production.

Why upgrades need a protocol handshake

Brokers in a cluster talk to each other using an internal binary protocol that evolves between releases. If a freshly upgraded broker immediately started speaking the newest protocol, older brokers still running the previous version could not understand it, and replication between them would break mid-upgrade. Kafka solves this with inter.broker.protocol.version, which pins the protocol the cluster uses regardless of the binary version installed. By keeping the IBP at the old version while you swap binaries, every broker — upgraded or not — speaks the same language. You only bump the IBP once every broker is running the new binary.

A second, related setting is log.message.format.version. Historically this controlled the on-disk record format and had to be coordinated so older consumers were not handed a format they could not parse. On modern Kafka (3.0+ with format v2) this setting is effectively frozen and you rarely touch it; for KRaft clusters the metadata version is managed separately via kafka-features.sh. Treat log.message.format.version as legacy unless you are upgrading from a pre-3.0 cluster.

Always change the binary first and the protocol version second, in two separate rolling restarts. Bumping the IBP before every broker is upgraded will cause the not-yet-upgraded brokers to receive a protocol they cannot handle.

The two-phase rolling upgrade

A safe upgrade is two passes over the cluster. In phase one you install the new binaries while explicitly holding the old protocol version. In phase two you raise the protocol version to unlock the new release’s features.

During phase one your server.properties should pin the current (pre-upgrade) versions so the new binary behaves like the old one on the wire:

# Phase 1: new binaries installed, but protocol pinned to the OLD version
inter.broker.protocol.version=3.7
log.message.format.version=3.7

Once every broker runs the new binary and the cluster is healthy, edit the config to the new version and do a second rolling restart:

# Phase 2: all brokers upgraded — raise the protocol to the NEW version
inter.broker.protocol.version=3.8
log.message.format.version=3.8

This ordering means that at no point are two brokers attempting to communicate with mismatched protocols, so replication and the controller continue functioning throughout.

Upgrading a single broker

For each broker, restart it gracefully so it cleanly hands off leadership before stopping. Restart only one broker at a time and wait for full recovery before moving to the next.

# Stop one broker (controlled shutdown migrates leadership off it)
kafka-server-stop.sh

# Install / point to the new Kafka version, then start it back up
kafka-server-start.sh -daemon /opt/kafka/config/server.properties

Output:

[KafkaServer id=2] started (kafka.server.KafkaServer)

After the broker rejoins, confirm it is back in the in-sync replica set for all its partitions before touching the next one. The cluster is only safe to continue when there are no under-replicated partitions.

kafka-topics.sh --bootstrap-server localhost:9092 --describe --under-replicated-partitions

Output:

An empty result means every partition is fully replicated — your signal to proceed to the next broker.

Client compatibility

Kafka maintains strong bidirectional compatibility, which is what makes zero-downtime upgrades possible without coordinating client deployments.

Direction	Supported?	Notes
Old client → new broker	Yes	Upgrade brokers first; existing producers/consumers keep working unchanged.
New client → old broker	Yes (within reason)	Modern clients negotiate down via the ApiVersions request; very new APIs degrade gracefully.
Mixed broker versions	Temporary only	Expected during the rolling upgrade; do not run a mixed-version cluster long-term.

The practical rule is upgrade brokers before clients. A broker on the new version happily serves clients on older versions, so you can roll the cluster first and migrate application teams onto new client libraries afterward at their own pace.

Validation between steps

Treat each broker restart as a checkpoint. Between steps, verify the cluster is genuinely healthy rather than just “process running”:

# Confirm the live broker count matches your cluster size
kafka-broker-api-versions.sh --bootstrap-server localhost:9092 | grep -c "id:"

# Check no partitions are offline
kafka-topics.sh --bootstrap-server localhost:9092 --describe --unavailable-partitions

Watch the kafka.server:type=ReplicaManager,name=UnderReplicatedPartitions and OfflinePartitionsCount JMX metrics throughout — both should return to zero after each restart. Also keep an eye on consumer group lag so you can catch a stalled consumer group early instead of after the whole cluster is upgraded.

Step-by-step checklist

[ ] 1. Read the release notes for breaking changes between your versions.
[ ] 2. Back up server.properties and confirm a working disaster-recovery path.
[ ] 3. Pin inter.broker.protocol.version (and log.message.format.version) to the CURRENT version.
[ ] 4. Rolling restart broker-by-broker onto the new binary; wait for 0 under-replicated partitions each time.
[ ] 5. Verify cluster health: all brokers online, no offline/unavailable partitions, lag stable.
[ ] 6. Raise inter.broker.protocol.version to the NEW version in config.
[ ] 7. Rolling restart broker-by-broker again to apply the new protocol.
[ ] 8. (KRaft) Upgrade the metadata version with kafka-features.sh once stable.
[ ] 9. Upgrade client applications at your own pace.

For KRaft clusters, step 8 finalizes new features:

kafka-features.sh --bootstrap-server localhost:9092 \
  upgrade --metadata 3.8

Best Practices

Always perform the upgrade in two phases: swap binaries with the old protocol pinned, then raise the protocol in a second rolling restart.
Restart exactly one broker at a time and block on zero under-replicated partitions before continuing — never batch restarts.
Use controlled shutdown (kafka-server-stop.sh) so leadership migrates cleanly instead of being abruptly lost.
Upgrade brokers before clients; rely on Kafka’s compatibility guarantees rather than coordinating a flag-day deployment.
Read the version-specific upgrade notes — some releases change defaults or require an explicit metadata/feature upgrade step.
Drive the process from concrete health signals (under-replicated and offline partition metrics, consumer lag), not just “the process started”.
Practice the full procedure in staging first so the production rollout is muscle memory, not improvisation.