Backup & Disaster Recovery

Kafka’s intra-cluster replication protects you from broker and disk failures, but it does nothing when an entire data center, region, or cloud availability zone goes dark. Disaster recovery (DR) is about surviving that larger blast radius: replicating data across independent clusters, backing up the metadata that makes a cluster usable, and rehearsing the failover procedure so that when the pager fires at 3 a.m. you are running a runbook, not improvising. A DR plan you have never tested is a hypothesis, not a plan.

Define your RPO and RTO first

Every DR decision flows from two numbers. Recovery Point Objective (RPO) is how much data you can afford to lose, measured in time. Recovery Time Objective (RTO) is how long you can be down before service resumes. Cross-cluster replication is asynchronous, so a non-zero RPO is unavoidable — the secondary always lags the primary by the replication latency.

Strategy	Typical RPO	Typical RTO	Cost	Notes
Single cluster, multi-AZ	0 (sync replicas)	minutes	Low	No protection from region loss
Active-passive (MM2)	seconds–minutes	minutes–hours	Medium	Standby cluster idle until failover
Active-active (MM2)	seconds	seconds–minutes	High	Both clusters serve traffic; risk of loops
Stretch cluster (rack-aware)	0	seconds	High	Needs low-latency links between sites

Quantify these numbers with the business before choosing a topology. “We need zero data loss and instant failover” usually evaporates once the cost of a synchronous stretch cluster is on the table.

Cross-cluster replication with MirrorMaker 2

MirrorMaker 2 (MM2) is the standard tool for replicating data between clusters. It runs as a set of Kafka Connect connectors — MirrorSourceConnector copies records, MirrorCheckpointConnector translates consumer offsets, and MirrorHeartbeatConnector measures end-to-end lag. MM2 also replicates topic configurations and ACLs, and creates topics on the target automatically.

Topics on the target are prefixed with the source cluster alias by default, so a topic orders from cluster primary lands as primary.orders. This naming makes the data flow obvious and prevents collisions in active-active setups.

A minimal connect-mirror-maker.properties for active-passive replication:

clusters = primary, dr
primary.bootstrap.servers = kafka-primary-1:9092,kafka-primary-2:9092
dr.bootstrap.servers = kafka-dr-1:9092,kafka-dr-2:9092

# Replicate primary -> dr only (active-passive)
primary->dr.enabled = true
dr->primary.enabled = false

# What to mirror
primary->dr.topics = orders|payments|inventory.*
primary->dr.groups = .*

# Keep target config and offsets in sync
sync.topic.configs.enabled = true
sync.group.offsets.enabled = true
emit.checkpoints.enabled = true

# Replication factor for MM2 internal topics
replication.factor = 3
checkpoints.topic.replication.factor = 3
offset-syncs.topic.replication.factor = 3

Run it as a dedicated process or, in production, deploy the connectors onto a Connect cluster:

connect-mirror-maker.sh connect-mirror-maker.properties

Output:

[INFO] Starting MirrorSourceConnector primary->dr
[INFO] Starting MirrorCheckpointConnector primary->dr
[INFO] Starting MirrorHeartbeatConnector primary->dr
[INFO] Found 3 topics matching 'orders|payments|inventory.*'
[INFO] Replicating to dr: primary.orders, primary.payments, primary.inventory.eu

Active-active and avoiding loops

For active-active, enable both directions (primary->dr.enabled and dr->primary.enabled). The topic-prefixing scheme is what prevents infinite replication loops: MM2 will not re-replicate a topic that already carries a remote cluster’s prefix, because each cluster only mirrors topics that lack its own alias. Producers in each region write to the local (unprefixed) topic; consumers that need a global view subscribe to both the local and the <remote>.-prefixed topics.

Offset translation for seamless failover

The hard part of failover is not the data — it is making consumers resume at the right place. Offsets are cluster-local: offset 5000 on the primary points to a different record than offset 5000 on the DR cluster. The MirrorCheckpointConnector solves this by maintaining an offset-syncs mapping and emitting checkpoints, then writing translated committed offsets into the DR cluster’s __consumer_offsets topic (when sync.group.offsets.enabled = true).

On failover, a consumer using the same group.id against the DR cluster picks up near where it left off — typically within the RPO window. If you manage offsets manually, use RemoteClusterUtils to translate:

import org.apache.kafka.connect.mirror.RemoteClusterUtils;
import org.apache.kafka.common.TopicPartition;
import java.util.Map;
import java.util.Properties;

Properties props = new Properties();
props.put("bootstrap.servers", "kafka-dr-1:9092");

// Translate the primary's committed offsets into DR-cluster offsets
Map<TopicPartition, Long> translated =
    RemoteClusterUtils.translateOffsets(props, "primary", "order-service", 30_000L);

translated.forEach((tp, offset) ->
    System.out.printf("Seek %s to offset %d on DR cluster%n", tp, offset));

Backing up topology, configs, and schemas

Data replication is not enough — you must also back up the metadata that defines a usable cluster. Treat all of it as version-controlled, declarative state:

Topic definitions (partitions, replication factor, retention, cleanup policy) exported via the AdminClient or kafka-topics.sh --describe.
Broker and topic-level configs dumped with kafka-configs.sh.
ACLs and quotas for security parity on the standby.
Schemas from the Schema Registry — export the _schemas topic or use the registry’s export API and restore with matching subject/version IDs so producers and consumers stay compatible.

# Export every topic's config as a restorable manifest
kafka-topics.sh --bootstrap-server kafka-primary-1:9092 \
  --describe --topics-with-overrides > topics-backup.txt

# Back up Schema Registry contents (compacted topic)
kafka-console-consumer.sh --bootstrap-server kafka-primary-1:9092 \
  --topic _schemas --from-beginning \
  --property print.key=true > schemas-backup.json

Tested failover runbooks

A runbook turns a crisis into a checklist. Document each step precisely, automate what you can, and schedule a game day at least quarterly to run the whole thing against real (or staging) clusters.

DR FAILOVER RUNBOOK — primary region loss
1. Declare incident; confirm primary is truly down (not a network partition).
2. Stop MM2 replication into DR to prevent split-brain writes.
3. Verify DR lag is within RPO (check heartbeat/checkpoint topics).
4. Repoint producer/consumer clients to dr.bootstrap.servers (DNS or config flag).
5. Confirm consumer groups resume via translated offsets; watch lag drain.
6. Update Schema Registry endpoint to the DR registry.
7. Notify stakeholders; begin monitoring DR as the new primary.

FAILBACK
8. Once primary recovers, reverse MM2 to seed primary from DR.
9. Wait for primary lag ~0, then schedule a planned cutover back.

Failover that flips clients while replication is still running is the classic split-brain trap: both clusters accept conflicting writes and the topics diverge. Always fence off the dead region’s replication before redirecting traffic.

Best Practices

Derive your topology from explicit, business-agreed RPO and RTO targets rather than defaulting to the most expensive option.
Use distinct cluster aliases and let MM2’s prefixing scheme prevent replication loops, especially in active-active deployments.
Always run MirrorCheckpointConnector with offset sync enabled so consumers can resume cleanly after failover.
Replicate metadata too — topic configs, ACLs, quotas, and Schema Registry contents — kept as version-controlled declarative manifests.
Stop replication into the target cluster before redirecting clients to avoid split-brain divergence.
Rehearse the full failover and failback runbook on a schedule; an untested DR plan will fail when you need it most.
Monitor replication lag and heartbeat freshness continuously, and alert when lag exceeds your RPO budget.