Retries & Non-Blocking Retry

Transient failures — a momentarily unavailable database, a rate-limited downstream API, a brief network blip — are a normal part of consuming events at scale. Spring for Apache Kafka gives you two fundamentally different retry strategies: blocking retries that pause the partition and re-deliver the same record in place, and non-blocking retries that forward the failed record to separate delay topics so the main partition keeps flowing. Choosing the right one matters in production, because the wrong choice can stall throughput for every key on a partition behind a single slow message.

Blocking retry with DefaultErrorHandler

By default, when a @KafkaListener method throws, Spring’s DefaultErrorHandler retries the record by re-invoking the listener on the same consumer thread, governed by a BackOff. Because the consumer cannot advance past a record until it succeeds or exhausts its attempts, the entire partition is blocked for the duration of the back-off. This is simple and ordering-preserving, but a record that waits ten seconds between attempts holds up every later record on that partition for the whole window.

Configure it by registering a DefaultErrorHandler with a back-off policy and, optionally, a recoverer that publishes to a dead-letter topic once attempts run out.

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.kafka.core.KafkaTemplate;
import org.springframework.kafka.listener.DeadLetterPublishingRecoverer;
import org.springframework.kafka.listener.DefaultErrorHandler;
import org.springframework.util.backoff.FixedBackOff;

@Configuration
public class ErrorHandlingConfig {

    @Bean
    public DefaultErrorHandler errorHandler(KafkaTemplate<Object, Object> template) {
        // Recover by sending the failed record to <topic>.DLT
        var recoverer = new DeadLetterPublishingRecoverer(template);
        // 3 retries, 2 seconds apart (4 total deliveries) — partition is blocked between attempts
        var backOff = new FixedBackOff(2000L, 3L);
        var handler = new DefaultErrorHandler(recoverer, backOff);
        // Don't retry validation errors — they will never succeed
        handler.addNotRetryableExceptions(IllegalArgumentException.class);
        return handler;
    }
}

Wire the handler into your container factory with factory.setCommonErrorHandler(errorHandler). For variable delays use ExponentialBackOffWithMaxRetries instead of FixedBackOff.

Blocking retry preserves per-partition ordering, which non-blocking retry deliberately does not. If strict in-order processing matters more than throughput, blocking retry is the correct tool — just keep the back-off short.

Non-blocking retry with @RetryableTopic

@RetryableTopic flips the model. When the listener throws, the record is immediately published to a dedicated retry topic and the original offset is committed, so the main partition is never blocked. A separate listener consumes the retry topic after the configured delay has elapsed, and if it fails again the record moves to the next retry topic. Once all attempts are exhausted, the record lands in a dead-letter topic (DLT). Spring auto-creates all of these topics for you.

import org.springframework.kafka.annotation.DltHandler;
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.kafka.annotation.RetryableTopic;
import org.springframework.kafka.retrytopic.TopicSuffixingStrategy;
import org.springframework.retry.annotation.Backoff;
import org.springframework.stereotype.Component;

@Component
public class PaymentListener {

    @RetryableTopic(
        attempts = "4",                                   // 1 main + 3 retries
        backoff = @Backoff(delay = 1000, multiplier = 2.0), // 1s, 2s, 4s
        topicSuffixingStrategy = TopicSuffixingStrategy.SUFFIX_WITH_INDEX_VALUE)
    @KafkaListener(topics = "payments", groupId = "payment-processor")
    public void handle(PaymentEvent event) {
        chargeCard(event); // throws on transient downstream failure
    }

    @DltHandler
    public void onDlt(PaymentEvent event) {
        log.error("Payment exhausted retries, parked in DLT: {}", event.id());
    }
}

Enable the feature once on a @Configuration class (or your main application class) with @EnableKafkaRetryTopic — without it the annotation is ignored.

Auto-created topics

With the SUFFIX_WITH_INDEX_VALUE strategy above, a payments topic produces the following set. Each retry topic is consumed by its own container that honours the delay before processing.

payments          <- main topic
payments-retry-0  <- 1s delay
payments-retry-1  <- 2s delay
payments-retry-2  <- 4s delay
payments-dlt      <- terminal: records that exhausted all attempts

The default suffixing strategy (SUFFIX_WITH_DELAY_VALUE) instead names them payments-retry-1000, payments-retry-2000, and so on — encoding the delay in the topic name.

Choosing between the two

Aspect	Blocking (`DefaultErrorHandler`)	Non-blocking (`@RetryableTopic`)
Main partition during back-off	Blocked — head-of-line stall	Free — keeps consuming
Ordering guarantee	Preserved per partition	Lost (retries processed out of order)
Extra topics	None	Retry topics + DLT auto-created
Best for	Short delays, strict ordering	Long delays, high throughput
Failure isolation	One bad record stalls the partition	One bad record only delays itself

Non-blocking retry trades ordering for throughput. Never use it for streams where a later event depends on an earlier one for the same key (e.g. account.debited before account.credited) — the retry topic can reorder them.

Best Practices

Reserve blocking retry for short back-offs and ordering-sensitive streams; reserve non-blocking retry for long delays or when one slow key must not stall the whole partition.
Always classify exceptions: mark deterministic failures (validation, deserialization, IllegalArgumentException) as non-retryable so they go straight to the DLT instead of burning retry attempts.
Cap total retry time so poisoned records reach the DLT quickly rather than cycling for minutes — exponential back-off with a sensible maxDelay keeps this bounded.
Always implement a @DltHandler (or a DLT consumer) so exhausted records are logged, alerted on, and recoverable rather than silently lost.
Use exponential back-off with a multiplier for downstream services so retries spread out instead of hammering an already-struggling dependency.
Pre-create retry and DLT topics with the right partition count and retention in production rather than relying on broker auto-creation, and ensure consumers can keep up with the extra topics.