Log Segments & Storage
Kafka’s reputation for throughput comes down to one deceptively simple idea: every partition is an append-only log written sequentially to disk and served straight from the operating system’s page cache. Understanding how that log is physically laid out in segment files and index files is essential for capacity planning, debugging disk-full incidents, and reasoning about retention. This page walks through the on-disk anatomy of a partition and explains why this design lets Kafka push gigabytes per second on commodity hardware.
A partition is a directory of files
A topic is split into partitions, and each partition is the unit of storage. On every broker that hosts a replica, the partition maps to a single directory under one of the configured log.dirs paths. The directory is named <topic>-<partition>, and inside it Kafka does not keep one giant file — it keeps a sequence of bounded segments. Each segment is a set of files that share a base offset (the offset of the first message they contain, zero-padded to 20 digits).
/var/lib/kafka/data/
└── orders-0/ # topic "orders", partition 0
├── 00000000000000000000.log # records, base offset 0 (closed)
├── 00000000000000000000.index # offset -> file position
├── 00000000000000000000.timeindex# timestamp -> offset
├── 00000000000000174336.log # active segment, base offset 174336
├── 00000000000000174336.index
├── 00000000000000174336.timeindex
├── leader-epoch-checkpoint
└── partition.metadata
Each segment has three primary files. The .log file holds the actual record batches. The .index file is a sparse map from relative offset to the byte position of that record inside the .log. The .timeindex file maps timestamps to offsets so time-based lookups (and time-based retention) stay cheap.
Active vs. closed segments
At any moment exactly one segment per partition is the active segment — the one with the highest base offset. All new producer writes append to its .log file. Closed (older) segments are immutable: Kafka never modifies them once they roll, which is what makes them safe to memory-map, cache, and serve with zero-copy.
A segment rolls — the active one is closed and a new one is created — when any of these limits is hit:
| Setting | Default | Triggers a roll when |
|---|---|---|
log.segment.bytes (segment.bytes) | 1 GiB | the active .log reaches this size |
log.roll.ms (segment.ms) | 7 days | the active segment has been open this long |
log.index.size.max.bytes | 10 MiB | the offset index fills up |
The broker-level keys (log.*) set the cluster default; the unprefixed topic-level keys override per topic. Set them with the admin CLI:
kafka-topics.sh --bootstrap-server localhost:9092 \
--create --topic orders --partitions 6 --replication-factor 3 \
--config segment.bytes=536870912 \
--config segment.ms=86400000
Smaller segments roll more often, which lets retention and compaction reclaim disk sooner — but they also multiply open file handles and index files. Tune
segment.bytesdown only when you need fine-grained retention, not as a default.
How the index makes reads fast
Indexes are sparse: Kafka adds an entry only every log.index.interval.bytes (4 KiB by default), not for every record. To read offset N, the broker binary-searches the offset index for the closest entry at or below N, jumps to that byte position in the .log, then scans forward. A sparse index keeps the index tiny enough to memory-map entirely, so lookups touch RAM, not disk.
You can inspect a segment’s contents and its index with the dump tool:
kafka-dump-log.sh --files /var/lib/kafka/data/orders-0/00000000000000000000.log \
--print-data-log
Output:
Dumping 00000000000000000000.log
Starting offset: 0
baseOffset: 0 lastOffset: 49 count: 50 baseSequence: 0 ... position: 0 ...
baseOffset: 50 lastOffset: 113 count: 64 ... position: 4187 ...
| offset: 0 CreateTime: 1717200000123 keySize: 6 valueSize: 24 ...
The position field is exactly what the .index stores, letting a fetch seek directly into the file.
Why sequential writes and page cache win
Producers append; consumers read in order; segments are never updated in place. That turns nearly all disk activity into sequential I/O, where even spinning disks and especially SSDs reach their peak bandwidth — random I/O can be orders of magnitude slower.
Kafka also does not maintain its own in-process cache. It writes to the OS page cache and lets the kernel flush to disk asynchronously, while reads for recent data are served from that same page cache without copying into the JVM heap. Combined with zero-copy (sendfile), the kernel can move bytes from page cache straight to the network socket. A typical Spring Boot consumer below never sees this machinery — it just reads, fast:
@Component
public class OrderConsumer {
@KafkaListener(topics = "orders", groupId = "billing")
public void handle(ConsumerRecord<String, OrderEvent> record) {
// Served from the broker's page cache when caught up;
// from disk segments when replaying history.
process(record.value());
}
public record OrderEvent(String orderId, long amountCents) {}
private void process(OrderEvent event) {
// business logic
}
}
Because Kafka relies on the page cache, give the broker host generous OS memory and a modest JVM heap (often 6 GB is plenty). Over-sizing the heap starves the page cache and hurts throughput.
Best practices
- Place
log.dirson dedicated, fast disks (NVMe/SSD) and use multiple directories across separate drives to spread I/O — Kafka balances partitions across them. - Keep the JVM heap small (around 6 GB) so the bulk of RAM stays available for the OS page cache that serves reads.
- Tune
segment.bytesandsegment.mstogether: align them with how quickly you want retention and compaction to reclaim space. - Avoid forcing synchronous flushes via
log.flush.interval.messages/log.flush.interval.ms; rely on replication for durability and let the OS flush in the background. - Monitor open file descriptors and disk usage per
log.dir— too many tiny segments exhaust file handles long before disks fill. - Never hand-edit or delete segment files while a broker is running; use retention, compaction, or the admin tools instead.