Broker & OS Tuning

Kafka’s throughput is determined as much by the operating system underneath the broker as by Kafka’s own settings. The broker is mostly a thin layer over the kernel page cache and the disk, so the biggest wins come from giving the OS room to cache log segments, sizing thread pools correctly, and getting out of the way of the storage subsystem. This page covers the broker and OS knobs that matter in production, in roughly the order they bite you.

Network and I/O thread pools

A broker has two thread pools that handle every request. num.network.threads accept connections and read/write bytes off the socket, then hand requests to num.io.threads, which do the actual work against the log (reads, appends, fsync). Saturating either pool causes request queue time to climb, which shows up as latency even when CPU and disk look idle.

# server.properties
num.network.threads=8
num.io.threads=16
# Bounded queue between network and I/O threads
queued.max.requests=500

A good starting point is num.network.threads ≈ number of cores and num.io.threads ≈ number of cores or the number of data directories, whichever is larger. Watch the request handler idle ratio and bump the pool that is starved.

Monitor RequestHandlerAvgIdlePercent and NetworkProcessorAvgIdlePercent. Values approaching 0 mean the corresponding pool is the bottleneck; values near 1.0 mean you have headroom and can leave the setting alone.

Don’t fsync — rely on replication

The single most common broker misconfiguration is forcing Kafka to flush every write to disk. Kafka’s durability model is replication, not fsync: a message acknowledged with acks=all is safe because it lives in the page cache of multiple brokers, not because it is on one broker’s platter. Let the OS flush pages lazily and you get the page cache working as a giant write buffer.

# Leave flushing to the OS — these are the safe defaults, do NOT lower them
log.flush.interval.messages=9223372036854775807
log.flush.interval.ms=null

Pair this with a replication factor of 3 and min.insync.replicas=2 so the cluster survives a single broker loss without data loss, even though no single broker fsyncs on the hot path.

Socket buffers

Larger socket buffers let the broker keep more in-flight data, which matters on high-bandwidth or high-latency links (cross-AZ replication, distant consumers).

socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600

For long fat networks, set these to -1 to let the OS auto-tune, and raise the kernel limits to match:

sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"

File descriptors and ulimits

Each partition replica is several open files (log segment, index, time index), and each connection is a descriptor. A busy broker easily needs hundreds of thousands. The default soft limit of 1024 will crash a broker with “Too many open files” under load.

# /etc/security/limits.d/kafka.conf
kafka  soft  nofile  300000
kafka  hard  nofile  300000

Verify the running broker actually picked it up:

cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "open files"

Output:

Max open files            300000               300000               files

Disk and filesystem layout

Use XFS on local NVMe or SSD. XFS handles Kafka’s large sequential writes and many open files better than ext4, and its delayed allocation suits append-heavy workloads. Mount with noatime so reads don’t generate metadata writes, and never run Kafka on network-attached storage like NFS.

# /etc/fstab
/dev/nvme1n1  /var/lib/kafka/disk1  xfs  defaults,noatime  0 0
/dev/nvme2n1  /var/lib/kafka/disk2  xfs  defaults,noatime  0 0

Spread the load across multiple physical disks with multiple log directories rather than RAID-0; Kafka balances partitions across them and a single disk failure only takes out the partitions on that disk.

log.dirs=/var/lib/kafka/disk1,/var/lib/kafka/disk2

Also relax dirty-page writeback so the kernel flushes in the background instead of stalling producers with synchronous flushes:

sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=60
sysctl -w vm.swappiness=1

JVM heap and garbage collection

The broker needs a surprisingly small heap. Almost all of Kafka’s “memory” is the OS page cache, which lives outside the JVM. A large heap steals RAM from the page cache and lengthens GC pauses for no benefit. Set a modest fixed heap and leave the rest of the machine’s RAM to the kernel.

export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 \
  -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent"

On a 64 GB machine, a 6 GB heap leaves roughly 55 GB for the page cache. G1GC with a 20 ms pause target keeps stop-the-world events short enough to avoid triggering replica fetch timeouts and spurious leader elections.

Setting	Typical value	Why
`-Xms` / `-Xmx`	5–8 GB (equal)	Small, fixed heap; rest goes to page cache
GC	G1GC	Predictable pauses on multi-GB heaps
`MaxGCPauseMillis`	20	Avoid pauses that look like a dead broker
`vm.swappiness`	1	Never swap broker memory to disk

Always set -Xms equal to -Xmx. Letting the heap grow at runtime causes large GC pauses during expansion, and on a broker those pauses can drop the node out of the ISR.

Quick tuning checklist

[ ] num.network.threads ~= cores, num.io.threads >= cores/disks
[ ] log.flush.* left at defaults (rely on replication, not fsync)
[ ] RF=3, min.insync.replicas=2, acks=all on producers
[ ] nofile ulimit raised to 100k+ and verified in /proc
[ ] XFS, noatime, multiple log.dirs on separate physical disks
[ ] vm.swappiness=1, dirty ratios tuned for background flush
[ ] JVM heap fixed at 5-8 GB, G1GC, rest of RAM left for page cache
[ ] RequestHandlerAvgIdlePercent and idle network ratio monitored

Best Practices

Treat replication as your durability mechanism and never force per-message fsync; the page cache plus acks=all is both faster and safe.
Keep the JVM heap small and equal (-Xms == -Xmx) so the kernel page cache, not the heap, holds your hot log segments.
Raise the open-file limit before going to production and confirm it from /proc/<pid>/limits, not just limits.conf.
Use XFS with noatime on local NVMe/SSD and spread partitions across multiple log.dirs instead of RAID-0.
Size num.io.threads and num.network.threads from observed idle ratios rather than guessing, and re-check after major traffic growth.
Pin vm.swappiness=1 so broker memory is never swapped, which would turn microsecond cache hits into millisecond disk reads.