Broker & OS Tuning
Kafka’s throughput is determined as much by the operating system underneath the broker as by Kafka’s own settings. The broker is mostly a thin layer over the kernel page cache and the disk, so the biggest wins come from giving the OS room to cache log segments, sizing thread pools correctly, and getting out of the way of the storage subsystem. This page covers the broker and OS knobs that matter in production, in roughly the order they bite you.
Network and I/O thread pools
A broker has two thread pools that handle every request. num.network.threads accept connections and read/write bytes off the socket, then hand requests to num.io.threads, which do the actual work against the log (reads, appends, fsync). Saturating either pool causes request queue time to climb, which shows up as latency even when CPU and disk look idle.
# server.properties
num.network.threads=8
num.io.threads=16
# Bounded queue between network and I/O threads
queued.max.requests=500
A good starting point is num.network.threads ≈ number of cores and num.io.threads ≈ number of cores or the number of data directories, whichever is larger. Watch the request handler idle ratio and bump the pool that is starved.
Monitor
RequestHandlerAvgIdlePercentandNetworkProcessorAvgIdlePercent. Values approaching 0 mean the corresponding pool is the bottleneck; values near 1.0 mean you have headroom and can leave the setting alone.
Don’t fsync — rely on replication
The single most common broker misconfiguration is forcing Kafka to flush every write to disk. Kafka’s durability model is replication, not fsync: a message acknowledged with acks=all is safe because it lives in the page cache of multiple brokers, not because it is on one broker’s platter. Let the OS flush pages lazily and you get the page cache working as a giant write buffer.
# Leave flushing to the OS — these are the safe defaults, do NOT lower them
log.flush.interval.messages=9223372036854775807
log.flush.interval.ms=null
Pair this with a replication factor of 3 and min.insync.replicas=2 so the cluster survives a single broker loss without data loss, even though no single broker fsyncs on the hot path.
Socket buffers
Larger socket buffers let the broker keep more in-flight data, which matters on high-bandwidth or high-latency links (cross-AZ replication, distant consumers).
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
For long fat networks, set these to -1 to let the OS auto-tune, and raise the kernel limits to match:
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 65536 16777216"
File descriptors and ulimits
Each partition replica is several open files (log segment, index, time index), and each connection is a descriptor. A busy broker easily needs hundreds of thousands. The default soft limit of 1024 will crash a broker with “Too many open files” under load.
# /etc/security/limits.d/kafka.conf
kafka soft nofile 300000
kafka hard nofile 300000
Verify the running broker actually picked it up:
cat /proc/$(pgrep -f kafka.Kafka)/limits | grep "open files"
Output:
Max open files 300000 300000 files
Disk and filesystem layout
Use XFS on local NVMe or SSD. XFS handles Kafka’s large sequential writes and many open files better than ext4, and its delayed allocation suits append-heavy workloads. Mount with noatime so reads don’t generate metadata writes, and never run Kafka on network-attached storage like NFS.
# /etc/fstab
/dev/nvme1n1 /var/lib/kafka/disk1 xfs defaults,noatime 0 0
/dev/nvme2n1 /var/lib/kafka/disk2 xfs defaults,noatime 0 0
Spread the load across multiple physical disks with multiple log directories rather than RAID-0; Kafka balances partitions across them and a single disk failure only takes out the partitions on that disk.
log.dirs=/var/lib/kafka/disk1,/var/lib/kafka/disk2
Also relax dirty-page writeback so the kernel flushes in the background instead of stalling producers with synchronous flushes:
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.dirty_ratio=60
sysctl -w vm.swappiness=1
JVM heap and garbage collection
The broker needs a surprisingly small heap. Almost all of Kafka’s “memory” is the OS page cache, which lives outside the JVM. A large heap steals RAM from the page cache and lengthens GC pauses for no benefit. Set a modest fixed heap and leave the rest of the machine’s RAM to the kernel.
export KAFKA_HEAP_OPTS="-Xms6g -Xmx6g"
export KAFKA_JVM_PERFORMANCE_OPTS="-XX:+UseG1GC -XX:MaxGCPauseMillis=20 \
-XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent"
On a 64 GB machine, a 6 GB heap leaves roughly 55 GB for the page cache. G1GC with a 20 ms pause target keeps stop-the-world events short enough to avoid triggering replica fetch timeouts and spurious leader elections.
| Setting | Typical value | Why |
|---|---|---|
-Xms / -Xmx | 5–8 GB (equal) | Small, fixed heap; rest goes to page cache |
| GC | G1GC | Predictable pauses on multi-GB heaps |
MaxGCPauseMillis | 20 | Avoid pauses that look like a dead broker |
vm.swappiness | 1 | Never swap broker memory to disk |
Always set
-Xmsequal to-Xmx. Letting the heap grow at runtime causes large GC pauses during expansion, and on a broker those pauses can drop the node out of the ISR.
Quick tuning checklist
[ ] num.network.threads ~= cores, num.io.threads >= cores/disks
[ ] log.flush.* left at defaults (rely on replication, not fsync)
[ ] RF=3, min.insync.replicas=2, acks=all on producers
[ ] nofile ulimit raised to 100k+ and verified in /proc
[ ] XFS, noatime, multiple log.dirs on separate physical disks
[ ] vm.swappiness=1, dirty ratios tuned for background flush
[ ] JVM heap fixed at 5-8 GB, G1GC, rest of RAM left for page cache
[ ] RequestHandlerAvgIdlePercent and idle network ratio monitored
Best Practices
- Treat replication as your durability mechanism and never force per-message fsync; the page cache plus
acks=allis both faster and safe. - Keep the JVM heap small and equal (
-Xms == -Xmx) so the kernel page cache, not the heap, holds your hot log segments. - Raise the open-file limit before going to production and confirm it from
/proc/<pid>/limits, not justlimits.conf. - Use XFS with
noatimeon local NVMe/SSD and spread partitions across multiplelog.dirsinstead of RAID-0. - Size
num.io.threadsandnum.network.threadsfrom observed idle ratios rather than guessing, and re-check after major traffic growth. - Pin
vm.swappiness=1so broker memory is never swapped, which would turn microsecond cache hits into millisecond disk reads.