Common Pitfalls & Anti-Patterns
Most Kafka incidents are not exotic. They come from a handful of defaults that look harmless in a demo and bite hard under production load: offsets committed before work is done, a topic created with one partition that can never scale, a single key that funnels all traffic to one broker. This page catalogs the mistakes that cause the most pain, in a problem-then-fix format you can map directly onto your own services.
Auto-commit before processing
Problem. With enable.auto.commit=true, the consumer commits offsets on a timer (auto.commit.interval.ms, default 5s) regardless of whether your code finished handling the records. If the process crashes after a commit but before the work completes, those records are gone — silent data loss. The inverse is also possible: a commit lands for messages you have only received, not processed.
Fix. Turn off auto-commit and commit only after the work is durable. With Spring for Apache Kafka, prefer AckMode.RECORD or MANUAL_IMMEDIATE and acknowledge explicitly.
@Configuration
public class ConsumerConfig {
@Bean
ConcurrentKafkaListenerContainerFactory<String, OrderEvent> factory(
ConsumerFactory<String, OrderEvent> consumerFactory) {
var factory = new ConcurrentKafkaListenerContainerFactory<String, OrderEvent>();
factory.setConsumerFactory(consumerFactory);
factory.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL_IMMEDIATE);
return factory;
}
}
@KafkaListener(topics = "orders", groupId = "billing")
public void onOrder(OrderEvent event, Acknowledgment ack) {
chargeCustomer(event); // do the real work first
ack.acknowledge(); // only then commit the offset
}
Commit after the side effect is durable, never before. Auto-commit trades correctness for one less line of config — a bad trade for anything that touches money or state.
Too few or too many partitions
Partition count is the unit of parallelism: a consumer group can have at most one consumer per partition. Pick one partition and you can never scale that group past a single thread, no matter how many instances you deploy. Go to the other extreme — tens of thousands of partitions per cluster — and you pay in controller metadata, open file handles, and longer leader-election and rebalance times.
| Symptom | Likely cause | Fix |
|---|---|---|
| Consumers idle, lag growing | Too few partitions | Size partitions ≥ peak consumer instances |
| Slow rebalances, high metadata load | Too many partitions | Consolidate topics; aim for a few thousand per broker, not tens of thousands |
| Uneven consumer load | Skewed keys | See hot keys below |
A practical sizing rule: partitions = max(target_throughput / per_partition_throughput, peak_consumer_count), then round up for headroom. You can increase partitions later, but doing so changes key-to-partition mapping and breaks ordering guarantees for existing keys — so size with growth in mind from the start.
Hot keys and partition skew
Problem. Partition assignment defaults to hash(key) % partitions. If one key dominates — a single tenant, a null-heavy field, a “global” event — that key’s partition becomes a hot spot. One broker and one consumer thread saturate while the rest sit idle, and lag climbs only on the hot partition.
Fix. Choose a high-cardinality key, or salt the key when ordering per-entity is not required.
// Bad: every event for the busiest tenant lands on one partition
producer.send(new ProducerRecord<>("events", tenantId, payload));
// Better: spread a hot tenant across N sub-partitions when global order isn't needed
String saltedKey = tenantId + "-" + ThreadLocalRandom.current().nextInt(8);
producer.send(new ProducerRecord<>("events", saltedKey, payload));
Monitor per-partition byte rate and consumer lag; a flat distribution is healthy, a spike on one partition is skew.
Oversized messages
Problem. Pushing multi-megabyte payloads through Kafka inflates broker memory, replication traffic, and consumer fetch buffers. It also requires bumping message.max.bytes, max.request.size, and fetch.max.bytes in lockstep — miss one and you get RecordTooLargeException.
Fix. Treat Kafka as an event log, not a file store. Keep records small (ideally < 1 MB) and use the claim-check pattern: write the blob to object storage and put only a reference on the topic.
public record DocumentUploaded(String documentId, String s3Uri, long sizeBytes) {}
// Producer stores the PDF in S3, then publishes the lightweight reference.
Ignoring consumer lag
Lag is the single best leading indicator of a sick consumer. Ignored, it grows until the oldest unread offsets fall off the retention window and you lose data outright.
kafka-consumer-groups.sh --bootstrap-server localhost:9092 \
--describe --group billing
Output:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
billing orders 0 84210 84212 2
billing orders 1 61003 93887 32884
Alert on lag trend, not absolute value — a steadily rising LAG on partition 1 above means that consumer can’t keep up. Export the metric to Prometheus and page on sustained growth.
No schema governance
Without a contract, the first producer that adds a field or changes a type breaks every downstream consumer at deploy time. Register schemas (Avro, Protobuf, or JSON Schema) and enforce a compatibility mode (BACKWARD is the common default) so the registry rejects breaking changes before they ship.
spring:
kafka:
producer:
properties:
schema.registry.url: http://schema-registry:8081
auto.register.schemas: false # register via CI, not silently at runtime
Blocking the poll loop
Problem. A consumer must call poll() regularly or the broker considers it dead and triggers a rebalance. Doing slow work (a synchronous HTTP call, a long DB transaction) inside the listener stalls the loop past max.poll.interval.ms (default 5 min) and the consumer gets kicked from the group — causing duplicate processing and a rebalance storm.
Fix. Keep per-record work fast, lower max.poll.records, and offload truly slow work. Spring’s container manages the poll loop for you; if you fan out to an executor, pause the partition so you don’t fetch faster than you can process.
max.poll.records=50
max.poll.interval.ms=300000
Treating Kafka like a database
Kafka is an append-only log, not a queryable store. Anti-patterns include doing point lookups by scanning a topic, relying on infinite retention as your system of record without compaction, or expecting transactional reads across topics. If you need lookups, project the stream into a real store (a DB, or a Kafka Streams KTable/state store) and query that.
Best Practices
- Disable auto-commit and acknowledge only after work is durable.
- Size partitions for peak consumer count plus headroom; remember repartitioning breaks key ordering.
- Use high-cardinality keys and watch per-partition skew, not just totals.
- Keep messages small; use the claim-check pattern for large payloads.
- Alert on consumer lag trend and export it as a first-class metric.
- Enforce schema compatibility in CI so breaking changes never reach the broker.
- Keep listener work fast and offload slow operations off the poll loop.