Common Pitfalls & Anti-Patterns

Most Kafka incidents are not exotic. They come from a handful of defaults that look harmless in a demo and bite hard under production load: offsets committed before work is done, a topic created with one partition that can never scale, a single key that funnels all traffic to one broker. This page catalogs the mistakes that cause the most pain, in a problem-then-fix format you can map directly onto your own services.

Auto-commit before processing

Problem. With enable.auto.commit=true, the consumer commits offsets on a timer (auto.commit.interval.ms, default 5s) regardless of whether your code finished handling the records. If the process crashes after a commit but before the work completes, those records are gone — silent data loss. The inverse is also possible: a commit lands for messages you have only received, not processed.

Fix. Turn off auto-commit and commit only after the work is durable. With Spring for Apache Kafka, prefer AckMode.RECORD or MANUAL_IMMEDIATE and acknowledge explicitly.

@Configuration
public class ConsumerConfig {

    @Bean
    ConcurrentKafkaListenerContainerFactory<String, OrderEvent> factory(
            ConsumerFactory<String, OrderEvent> consumerFactory) {
        var factory = new ConcurrentKafkaListenerContainerFactory<String, OrderEvent>();
        factory.setConsumerFactory(consumerFactory);
        factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
        return factory;
    }
}

@KafkaListener(topics = "orders", groupId = "billing")
public void onOrder(OrderEvent event, Acknowledgment ack) {
    chargeCustomer(event);   // do the real work first
    ack.acknowledge();       // only then commit the offset
}

Commit after the side effect is durable, never before. Auto-commit trades correctness for one less line of config — a bad trade for anything that touches money or state.

Too few or too many partitions

Partition count is the unit of parallelism: a consumer group can have at most one consumer per partition. Pick one partition and you can never scale that group past a single thread, no matter how many instances you deploy. Go to the other extreme — tens of thousands of partitions per cluster — and you pay in controller metadata, open file handles, and longer leader-election and rebalance times.

Symptom	Likely cause	Fix
Consumers idle, lag growing	Too few partitions	Size partitions ≥ peak consumer instances
Slow rebalances, high metadata load	Too many partitions	Consolidate topics; aim for a few thousand per broker, not tens of thousands
Uneven consumer load	Skewed keys	See hot keys below

A practical sizing rule: partitions = max(target_throughput / per_partition_throughput, peak_consumer_count), then round up for headroom. You can increase partitions later, but doing so changes key-to-partition mapping and breaks ordering guarantees for existing keys — so size with growth in mind from the start.

Hot keys and partition skew

Problem. Partition assignment defaults to hash(key) % partitions. If one key dominates — a single tenant, a null-heavy field, a “global” event — that key’s partition becomes a hot spot. One broker and one consumer thread saturate while the rest sit idle, and lag climbs only on the hot partition.

Fix. Choose a high-cardinality key, or salt the key when ordering per-entity is not required.

// Bad: every event for the busiest tenant lands on one partition
producer.send(new ProducerRecord<>("events", tenantId, payload));

// Better: spread a hot tenant across N sub-partitions when global order isn't needed
String saltedKey = tenantId + "-" + ThreadLocalRandom.current().nextInt(8);
producer.send(new ProducerRecord<>("events", saltedKey, payload));

Monitor per-partition byte rate and consumer lag; a flat distribution is healthy, a spike on one partition is skew.

Oversized messages

Problem. Pushing multi-megabyte payloads through Kafka inflates broker memory, replication traffic, and consumer fetch buffers. It also requires bumping message.max.bytes, max.request.size, and fetch.max.bytes in lockstep — miss one and you get RecordTooLargeException.

Fix. Treat Kafka as an event log, not a file store. Keep records small (ideally < 1 MB) and use the claim-check pattern: write the blob to object storage and put only a reference on the topic.

public record DocumentUploaded(String documentId, String s3Uri, long sizeBytes) {}
// Producer stores the PDF in S3, then publishes the lightweight reference.

Ignoring consumer lag

Lag is the single best leading indicator of a sick consumer. Ignored, it grows until the oldest unread offsets fall off the retention window and you lose data outright.

kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
  --describe --group billing

Output:

GROUP    TOPIC   PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
billing  orders  0          84210           84212           2
billing  orders  1          61003           93887           32884

Alert on lag trend, not absolute value — a steadily rising LAG on partition 1 above means that consumer can’t keep up. Export the metric to Prometheus and page on sustained growth.

No schema governance

Without a contract, the first producer that adds a field or changes a type breaks every downstream consumer at deploy time. Register schemas (Avro, Protobuf, or JSON Schema) and enforce a compatibility mode (BACKWARD is the common default) so the registry rejects breaking changes before they ship.

spring:
  kafka:
    producer:
      properties:
        schema.registry.url: http://schema-registry:8081
        auto.register.schemas: false   # register via CI, not silently at runtime

Blocking the poll loop

Problem. A consumer must call poll() regularly or the broker considers it dead and triggers a rebalance. Doing slow work (a synchronous HTTP call, a long DB transaction) inside the listener stalls the loop past max.poll.interval.ms (default 5 min) and the consumer gets kicked from the group — causing duplicate processing and a rebalance storm.

Fix. Keep per-record work fast, lower max.poll.records, and offload truly slow work. Spring’s container manages the poll loop for you; if you fan out to an executor, pause the partition so you don’t fetch faster than you can process.

max.poll.records=50
max.poll.interval.ms=300000

Treating Kafka like a database

Kafka is an append-only log, not a queryable store. Anti-patterns include doing point lookups by scanning a topic, relying on infinite retention as your system of record without compaction, or expecting transactional reads across topics. If you need lookups, project the stream into a real store (a DB, or a Kafka Streams KTable/state store) and query that.

Best Practices

Disable auto-commit and acknowledge only after work is durable.
Size partitions for peak consumer count plus headroom; remember repartitioning breaks key ordering.
Use high-cardinality keys and watch per-partition skew, not just totals.
Keep messages small; use the claim-check pattern for large payloads.
Alert on consumer lag trend and export it as a first-class metric.
Enforce schema compatibility in CI so breaking changes never reach the broker.
Keep listener work fast and offload slow operations off the poll loop.