Retries & Timeouts
Sending a record to Kafka is rarely a single, instantaneous act. Brokers fail over, leaders move during a rebalance, and networks drop packets — all transient conditions that resolve themselves within seconds. The producer’s retry and timeout configuration is what turns these blips into invisible non-events instead of lost messages or application errors. Getting this model right is the difference between a producer that quietly survives a rolling broker restart and one that throws TimeoutException at the first sign of trouble.
The timeout model at a glance
Modern Kafka producers use delivery.timeout.ms as the single, authoritative deadline for a record. From the moment send() returns (the record enters the accumulator) until the producer reports success or failure to your callback, the total elapsed time must not exceed this budget. Everything else — batching delay, individual request attempts, and retries — happens inside that window.
send() called
|
|-- linger.ms / batching ----.
| |
|-- request attempt #1 ------| <= request.timeout.ms each
|-- retry.backoff.ms wait |
|-- request attempt #2 ------| <= request.timeout.ms each
|-- ... |
'----------------------------' <= delivery.timeout.ms TOTAL
|
success or TimeoutException -> callback
Because delivery.timeout.ms caps the whole lifecycle, the producer will keep retrying until either the record is acknowledged or the deadline passes — at which point it stops and fails the record, regardless of how many retries remain.
request.timeout.ms — the per-attempt deadline
request.timeout.ms controls how long the producer waits for a broker to acknowledge a single in-flight request before giving up on that attempt and (if time and retries allow) resending. It is not the overall deadline. A short request.timeout.ms means the producer detects a slow or unresponsive broker quickly and retries against the new leader; a longer one tolerates slow brokers but reacts more sluggishly to genuine failures. The default is 30 seconds.
retries and retry.backoff.ms
retries is the maximum number of additional attempts after the first failed request for a recoverable error (such as NotLeaderForPartitionException or a request timeout). In modern clients it defaults to Integer.MAX_VALUE, which sounds dangerous but is safe precisely because delivery.timeout.ms is the real stopping condition — the producer effectively retries “as many times as fit in the deadline.”
retry.backoff.ms (default 100 ms) is the pause between attempts, preventing a tight retry loop from hammering a recovering broker. Clients also apply exponential backoff up to retry.backoff.max.ms.
# Tune the deadline and per-request behaviour; leave retries effectively unbounded.
delivery.timeout.ms=120000
request.timeout.ms=30000
retries=2147483647
retry.backoff.ms=100
retry.backoff.max.ms=1000
A hard constraint the producer validates at startup:
delivery.timeout.msmust be >=request.timeout.ms+linger.ms. If it isn’t, the producer throws aConfigExceptionimmediately rather than letting you ship a configuration that can never complete even a single attempt.
The ordering trap: max.in.flight.requests > 1
max.in.flight.requests.per.connection is how many unacknowledged requests the producer allows on a single connection at once (default 5). It interacts dangerously with retries. Suppose two batches, B1 then B2, are in flight to the same partition. If B1 fails and is retried while B2 already succeeded, B1 now lands after B2 — the records are reordered.
This is why the safe answer is not “set max.in.flight.requests.per.connection=1” (which kills throughput) but enable idempotence. With enable.idempotence=true, the broker uses sequence numbers to reject out-of-order batches and force the producer to resend in the correct order, so you keep ordering and pipelining up to 5 in-flight requests.
import org.apache.kafka.clients.producer.*;
import java.util.Properties;
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
// Safe retries: idempotence preserves ordering even with pipelining + retries.
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000);
props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30_000);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);
try (Producer<String, String> producer = new KafkaProducer<>(props)) {
producer.send(new ProducerRecord<>("orders", "k1", "v1"), (md, ex) -> {
if (ex instanceof org.apache.kafka.common.errors.TimeoutException) {
// delivery.timeout.ms elapsed before the record was acknowledged.
System.err.println("Gave up on record: " + ex.getMessage());
} else if (ex != null) {
System.err.println("Send failed: " + ex.getMessage());
} else {
System.out.printf("Delivered to %s-%d@%d%n",
md.topic(), md.partition(), md.offset());
}
});
}
Output:
Delivered to orders-0@42
Configuring the same thing in Spring Boot
spring:
kafka:
producer:
acks: all
properties:
enable.idempotence: true
delivery.timeout.ms: 120000
request.timeout.ms: 30000
retry.backoff.ms: 100
max.in.flight.requests.per.connection: 5
Timeout & retry config reference
| Property | Default | Scope | What it controls |
|---|---|---|---|
delivery.timeout.ms | 120000 | Whole record lifecycle | Total deadline from send() to success/failure. The real stopping condition. |
request.timeout.ms | 30000 | Single request | How long to wait for one broker acknowledgement before retrying. |
retries | 2147483647 | Record | Max retry attempts; capped in practice by delivery.timeout.ms. |
retry.backoff.ms | 100 | Between attempts | Initial pause before a retry. |
retry.backoff.max.ms | 1000 | Between attempts | Upper bound for exponential retry backoff. |
max.in.flight.requests.per.connection | 5 | Connection | Unacked requests allowed at once; >1 can reorder without idempotence. |
max.block.ms | 60000 | send() / metadata | How long send() blocks waiting for buffer space or metadata. |
Best Practices
- Treat
delivery.timeout.msas the knob you actually tune — set it to the longest you are willing to wait for a record, and letretriesstay effectively unbounded. - Keep
enable.idempotence=true(the modern default) so retries are safe and ordering is preserved even withmax.in.flight.requests.per.connection=5. - Never set
max.in.flight.requests.per.connection > 1without idempotence on an ordering-sensitive topic — silent reordering is the result. - Lower
request.timeout.ms(notdelivery.timeout.ms) if you want faster failover to a new leader during broker restarts. - Always handle
TimeoutExceptiondistinctly in your callback; it means the deadline elapsed, not that the broker rejected the data. - Size
delivery.timeout.msto comfortably exceedrequest.timeout.ms + linger.ms, or the producer will refuse to start.