Skip to content
Apache Kafka kf architecture 5 min read

Page Cache & Zero-Copy

Kafka routinely pushes millions of messages per second through commodity hardware, yet it is written in the JVM and persists everything to disk. The trick is that Kafka does almost no clever caching of its own. Instead it leans on four well-understood operating-system and hardware behaviours: the OS page cache, sequential disk access, zero-copy data transfer, and aggressive batching with compression. Understanding these is essential for capacity planning, because they dictate how you size memory, choose disks, and tune JVM heap in production.

Reliance on the OS page cache

Most data systems build an in-process cache (a buffer pool) on top of the disk. Kafka deliberately does not. When a producer writes a record, it is appended to a log segment file and the bytes land in the operating system’s page cache — the kernel’s in-memory copy of recently touched file pages. Reads served to consumers usually come straight from that same page cache, never touching the physical disk.

This design has several consequences:

  • The JVM heap stays small. Kafka does not hold message payloads on the heap, so garbage-collection pressure is low even at high throughput.
  • Free RAM becomes the cache automatically. The kernel manages eviction, and a broker that has been running for a while will show most of its RAM as page cache, not application memory.
  • A broker restart does not cold-start the cache. Because the cache lives in the kernel, it survives a JVM restart (though not a machine reboot).

Tip: Give the JVM a modest heap (commonly 5-6 GB is plenty) and leave the rest of physical memory to the OS for page cache. Over-sizing the heap starves the page cache and can actually reduce throughput.

Sequential disk I/O

Kafka’s log is an append-only structure. Producers only ever add records to the end of the active segment, and consumers read forward from an offset. This turns nearly all disk activity into sequential reads and writes, which are dramatically faster than random access — often by two or three orders of magnitude on spinning disks, and still meaningfully faster on SSDs because of read-ahead and large contiguous transfers.

Because writes are sequential appends, Kafka can hand them to the page cache and let the kernel flush them to disk in large, efficient batches via its normal writeback machinery. Kafka does not call fsync on every write by default; durability comes primarily from replication across brokers rather than from forcing each write to platter.

BehaviourRandom I/OKafka sequential I/O
Access patternSeeks all over the diskAppend to tail / read forward
Throughput on HDDA few MB/sHundreds of MB/s
Page-cache friendlinessPoor (scattered pages)Excellent (contiguous pages)
Durability modelOften per-write fsyncReplication + periodic flush

You can tune flush behaviour if you have stricter requirements, but doing so trades throughput for synchronous durability:

# Flush after N messages or after T milliseconds (defaults are effectively "never" by count)
log.flush.interval.messages=10000
log.flush.interval.ms=1000

Zero-copy transfer with sendfile

The biggest single win for consumer reads is zero-copy. In a naive server, sending a file to a network socket copies data four times and crosses the user/kernel boundary repeatedly: disk to kernel page cache, page cache to a user-space buffer, user buffer back into a kernel socket buffer, and finally socket buffer to the NIC.

Kafka avoids the two user-space copies by using the sendfile system call (exposed in Java as FileChannel.transferTo). The kernel transfers bytes directly from the page cache to the socket without ever materialising them in the JVM, and on modern NICs with scatter-gather DMA the data can go straight from the page cache to the network card.

Traditional copy path:                    Zero-copy (sendfile) path:

disk -> page cache                         disk -> page cache
     -> application buffer (JVM)                -> socket / NIC  (DMA)
     -> socket buffer
     -> NIC
4 copies, 2 context-switch round trips     2 copies, no JVM involvement

There is an important caveat: zero-copy only works when the broker can hand the on-disk bytes to the socket unchanged. If the broker has to re-encrypt or re-format data — for example when TLS is terminated at the broker, or when an older consumer forces a message-format down-conversion — the kernel cannot use sendfile and the data must pass through user space.

Warning: Enabling broker-to-consumer TLS disables sendfile zero-copy, since the bytes must be encrypted in user space. Budget extra CPU for encrypted clusters, and keep all clients on the current message format to avoid down-conversion.

Batching and compression

Kafka amortises per-message overhead by grouping records into batches on the producer before they are sent, stored, and replicated. A batch is compressed once and travels through the entire pipeline — producer, broker disk, replica, and consumer — in its compressed form. This keeps compression CPU off the broker and preserves zero-copy, because the broker never decompresses to serve a fetch.

Tune batching on the producer to trade a little latency for much higher throughput:

Properties props = new Properties();
props.put("bootstrap.servers", "broker1:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

// Batch up to 64 KB, and wait up to 10 ms to let a batch fill before sending.
props.put("batch.size", 64 * 1024);
props.put("linger.ms", 10);
props.put("compression.type", "lz4");

try (KafkaProducer<String, String> producer = new KafkaProducer<>(props)) {
    producer.send(new ProducerRecord<>("orders", "k1", "v1"));
}

The equivalent in a Spring Boot 3.x application is set through application.yaml:

spring:
  kafka:
    producer:
      batch-size: 65536
      properties:
        linger.ms: 10
      compression-type: lz4

You can confirm the page cache is doing its job by watching how little of a busy broker’s traffic actually hits disk:

free -h

Output:

               total        used        free      shared  buff/cache   available
Mem:            62Gi       6.1Gi       1.2Gi        12Mi        55Gi        55Gi

The buff/cache column — here 55 GB — is page cache holding hot log segments, which is exactly what makes consumer fetches return without disk reads.

Best Practices

  • Keep the JVM heap small (around 5-6 GB) and leave the vast majority of RAM free for the OS page cache.
  • Use fast, sequential-friendly storage; multiple disks with one log directory each spreads I/O and increases sequential throughput.
  • Rely on replication for durability rather than aggressive per-write fsync, unless a specific workload demands synchronous flushing.
  • Keep producers and consumers on the current message format to preserve zero-copy and avoid broker-side down-conversion.
  • Account for the CPU cost of TLS, since broker-side encryption disables sendfile zero-copy.
  • Enable producer batching and a lightweight codec like lz4 or zstd so compression happens once and survives end to end.
  • Monitor buff/cache and broker disk-read rates; rising physical reads signal that the working set no longer fits in the page cache.
Last updated June 1, 2026
Was this helpful?