Handling Failures & Resilience

Distributed systems fail constantly: brokers crash, disks fill, networks partition, and consumers die mid-poll. Kafka is built to survive these events without losing data or stalling forever, but only if your clients and topics are configured for it. This page walks through the main failure modes you will hit in production and the mechanisms — replication, retries, rebalancing, and dead-letter handling — that let your application recover gracefully instead of paging you at 3 a.m.

Broker and leader failure

Every partition has one leader and a set of replicas. Producers and consumers only talk to the leader; followers in the in-sync replica (ISR) set continuously fetch from it. When a broker hosting a leader dies, the controller promotes one of the in-sync followers to leader. Clients discover the new leader transparently on their next metadata refresh and resume work.

The guarantee that no acknowledged data is lost during failover depends on three settings working together:

Setting	Where	Recommended	Why it matters
`replication.factor`	topic	3	Tolerates the loss of one broker while keeping a quorum
`min.insync.replicas`	topic/broker	2	A write with `acks=all` is rejected unless 2 replicas confirm
`acks`	producer	`all`	Producer waits for all ISR members before considering a write done
`unclean.leader.election.enable`	broker/topic	`false`	Forbids electing an out-of-sync replica, which would lose data

With acks=all and min.insync.replicas=2, a producer write is durable as long as at least two replicas survive. If only one replica remains in sync, producers receive NotEnoughReplicasException rather than silently writing data that could vanish.

Producer retries and timeouts

A leader failover causes transient errors — NotLeaderForPartitionException, NetworkException, request timeouts. The producer treats these as retriable and resends automatically. The key is to pair retries with idempotence so resends never create duplicates.

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "broker1:9092,broker2:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());

props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);   // dedupes retries
props.put(ProducerConfig.RETRIES_CONFIG, Integer.MAX_VALUE); // bounded by delivery.timeout.ms
props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30_000);

try (Producer<String, String> producer = new KafkaProducer<>(props)) {
    producer.send(new ProducerRecord<>("orders", "order-42", "{...}"), (md, ex) -> {
        if (ex != null) {
            // Fired only after delivery.timeout.ms is exhausted — this is a real failure.
            log.error("Permanent send failure for orders", ex);
        }
    });
}

delivery.timeout.ms is the true upper bound on how long send() will keep retrying; retries alone no longer caps duration. With idempotence enabled you can safely keep max.in.flight.requests.per.connection at 5 and still preserve per-partition ordering.

Consumer crash, rebalance, and offset replay

When a consumer instance crashes or is slow, the group coordinator detects it (via missed heartbeats or an exceeded max.poll.interval.ms) and triggers a rebalance, reassigning that consumer’s partitions to surviving members. Those partitions resume from the last committed offset — so any records processed but not yet committed are reprocessed.

This replay is the heart of at-least-once delivery and is why consumer logic should be idempotent. The two failure-relevant timeouts are:

Property	Default	Controls
`session.timeout.ms`	45000	How long the coordinator waits for heartbeats before evicting a member
`heartbeat.interval.ms`	3000	How often the client sends heartbeats
`max.poll.interval.ms`	300000	Max time between `poll()` calls before the member is considered stuck

A common production bug is a slow handler that exceeds max.poll.interval.ms, causing the member to be kicked out, its work reassigned, and the same batch reprocessed elsewhere — a self-inflicted rebalance storm. Fix it by shrinking max.poll.records, speeding up processing, or moving work off the poll thread.

Network partitions

A network partition can isolate a broker from the controller while it still believes it is a leader. Kafka avoids split-brain because writes require ISR acknowledgement: an isolated leader cannot reach min.insync.replicas and starts rejecting acks=all writes, while a new leader is elected on the majority side. Producers see retriable errors and reconnect to the new leader once metadata refreshes. Keeping unclean.leader.election.enable=false ensures the isolated, possibly-ahead replica is never silently chosen, which would otherwise discard committed records.

Poison pills and the dead-letter topic

Not all failures are infrastructure failures. A single malformed record — a “poison pill” — can crash a consumer in an infinite retry loop because the offset can never advance past it. Spring for Apache Kafka handles this at two layers: an error-handling deserializer that converts deserialization failures into recoverable records, and a DefaultErrorHandler + DeadLetterPublishingRecoverer that retries a few times and then routes the bad record to a dead-letter topic (DLT) so the stream keeps flowing.

@Configuration
public class KafkaErrorConfig {

    @Bean
    public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
        var recoverer = new DeadLetterPublishingRecoverer(template,
            (record, ex) -> new TopicPartition(record.topic() + ".DLT", record.partition()));
        // 3 retries, 1s apart, then publish to <topic>.DLT
        return new DefaultErrorHandler(recoverer, new FixedBackOff(1000L, 3L));
    }
}

spring:
  kafka:
    consumer:
      key-deserializer: org.springframework.kafka.support.serializer.ErrorHandlingDeserializer
      value-deserializer: org.springframework.kafka.support.serializer.ErrorHandlingDeserializer
      properties:
        spring.deserializer.value.delegate.class: org.springframework.kafka.support.serializer.JsonDeserializer
        spring.json.trusted.packages: "com.devcraftly.events"

A record that still fails after the configured retries lands on orders.DLT with the original payload plus exception headers, letting you alert, inspect, and replay it later instead of blocking the whole partition.

Output:

ERROR o.s.k.l.DeadLetterPublishingRecoverer - Publishing failed record to orders.DLT-0
  exception: com.fasterxml.jackson.databind.exc.InvalidFormatException

Best Practices

Run topics with replication.factor=3 and min.insync.replicas=2, and keep unclean.leader.election.enable=false.
Always set acks=all with enable.idempotence=true on producers, and rely on delivery.timeout.ms rather than a fixed retry count.
Make consumer processing idempotent so offset replay after a rebalance never corrupts state.
Keep handler work fast (or async) so you never breach max.poll.interval.ms and trigger needless rebalances.
Guard every consumer with an ErrorHandlingDeserializer and a DLT so a single poison pill cannot stall a partition.
Monitor under-replicated partitions, ISR shrink/expand events, and consumer lag — they are the earliest signals of a brewing failure.