Retries & Timeouts

Sending a record to Kafka is rarely a single, instantaneous act. Brokers fail over, leaders move during a rebalance, and networks drop packets — all transient conditions that resolve themselves within seconds. The producer’s retry and timeout configuration is what turns these blips into invisible non-events instead of lost messages or application errors. Getting this model right is the difference between a producer that quietly survives a rolling broker restart and one that throws TimeoutException at the first sign of trouble.

The timeout model at a glance

Modern Kafka producers use delivery.timeout.ms as the single, authoritative deadline for a record. From the moment send() returns (the record enters the accumulator) until the producer reports success or failure to your callback, the total elapsed time must not exceed this budget. Everything else — batching delay, individual request attempts, and retries — happens inside that window.

send() called
   |
   |-- linger.ms / batching ----.
   |                            |
   |-- request attempt #1 ------|  <= request.timeout.ms each
   |-- retry.backoff.ms wait    |
   |-- request attempt #2 ------|  <= request.timeout.ms each
   |-- ...                      |
   '----------------------------'  <= delivery.timeout.ms TOTAL
                                |
                          success or TimeoutException -> callback

Because delivery.timeout.ms caps the whole lifecycle, the producer will keep retrying until either the record is acknowledged or the deadline passes — at which point it stops and fails the record, regardless of how many retries remain.

request.timeout.ms — the per-attempt deadline

request.timeout.ms controls how long the producer waits for a broker to acknowledge a single in-flight request before giving up on that attempt and (if time and retries allow) resending. It is not the overall deadline. A short request.timeout.ms means the producer detects a slow or unresponsive broker quickly and retries against the new leader; a longer one tolerates slow brokers but reacts more sluggishly to genuine failures. The default is 30 seconds.

retries and retry.backoff.ms

retries is the maximum number of additional attempts after the first failed request for a recoverable error (such as NotLeaderForPartitionException or a request timeout). In modern clients it defaults to Integer.MAX_VALUE, which sounds dangerous but is safe precisely because delivery.timeout.ms is the real stopping condition — the producer effectively retries “as many times as fit in the deadline.”

retry.backoff.ms (default 100 ms) is the pause between attempts, preventing a tight retry loop from hammering a recovering broker. Clients also apply exponential backoff up to retry.backoff.max.ms.

# Tune the deadline and per-request behaviour; leave retries effectively unbounded.
delivery.timeout.ms=120000
request.timeout.ms=30000
retries=2147483647
retry.backoff.ms=100
retry.backoff.max.ms=1000

A hard constraint the producer validates at startup: delivery.timeout.ms must be >= request.timeout.ms + linger.ms. If it isn’t, the producer throws a ConfigException immediately rather than letting you ship a configuration that can never complete even a single attempt.

The ordering trap: max.in.flight.requests > 1

max.in.flight.requests.per.connection is how many unacknowledged requests the producer allows on a single connection at once (default 5). It interacts dangerously with retries. Suppose two batches, B1 then B2, are in flight to the same partition. If B1 fails and is retried while B2 already succeeded, B1 now lands after B2 — the records are reordered.

This is why the safe answer is not “set max.in.flight.requests.per.connection=1” (which kills throughput) but enable idempotence. With enable.idempotence=true, the broker uses sequence numbers to reject out-of-order batches and force the producer to resend in the correct order, so you keep ordering and pipelining up to 5 in-flight requests.

import org.apache.kafka.clients.producer.*;
import java.util.Properties;

Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
        "org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
        "org.apache.kafka.common.serialization.StringSerializer");

// Safe retries: idempotence preserves ordering even with pipelining + retries.
props.put(ProducerConfig.ENABLE_IDEMPOTENCE_CONFIG, true);
props.put(ProducerConfig.ACKS_CONFIG, "all");
props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 120_000);
props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 30_000);
props.put(ProducerConfig.MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION, 5);

try (Producer<String, String> producer = new KafkaProducer<>(props)) {
    producer.send(new ProducerRecord<>("orders", "k1", "v1"), (md, ex) -> {
        if (ex instanceof org.apache.kafka.common.errors.TimeoutException) {
            // delivery.timeout.ms elapsed before the record was acknowledged.
            System.err.println("Gave up on record: " + ex.getMessage());
        } else if (ex != null) {
            System.err.println("Send failed: " + ex.getMessage());
        } else {
            System.out.printf("Delivered to %s-%d@%d%n",
                    md.topic(), md.partition(), md.offset());
        }
    });
}

Output:

Delivered to orders-0@42

Configuring the same thing in Spring Boot

spring:
  kafka:
    producer:
      acks: all
      properties:
        enable.idempotence: true
        delivery.timeout.ms: 120000
        request.timeout.ms: 30000
        retry.backoff.ms: 100
        max.in.flight.requests.per.connection: 5

Timeout & retry config reference

Property	Default	Scope	What it controls
`delivery.timeout.ms`	`120000`	Whole record lifecycle	Total deadline from `send()` to success/failure. The real stopping condition.
`request.timeout.ms`	`30000`	Single request	How long to wait for one broker acknowledgement before retrying.
`retries`	`2147483647`	Record	Max retry attempts; capped in practice by `delivery.timeout.ms`.
`retry.backoff.ms`	`100`	Between attempts	Initial pause before a retry.
`retry.backoff.max.ms`	`1000`	Between attempts	Upper bound for exponential retry backoff.
`max.in.flight.requests.per.connection`	`5`	Connection	Unacked requests allowed at once; >1 can reorder without idempotence.
`max.block.ms`	`60000`	`send()` / metadata	How long `send()` blocks waiting for buffer space or metadata.

Best Practices

Treat delivery.timeout.ms as the knob you actually tune — set it to the longest you are willing to wait for a record, and let retries stay effectively unbounded.
Keep enable.idempotence=true (the modern default) so retries are safe and ordering is preserved even with max.in.flight.requests.per.connection=5.
Never set max.in.flight.requests.per.connection > 1 without idempotence on an ordering-sensitive topic — silent reordering is the result.
Lower request.timeout.ms (not delivery.timeout.ms) if you want faster failover to a new leader during broker restarts.
Always handle TimeoutException distinctly in your callback; it means the deadline elapsed, not that the broker rejected the data.
Size delivery.timeout.ms to comfortably exceed request.timeout.ms + linger.ms, or the producer will refuse to start.