Tiered Storage

For years the retention story in Kafka was bounded by one hard limit: every byte you wanted to keep had to live on a broker’s local disk. Keeping a month of data meant provisioning a month’s worth of expensive, replicated SSDs on every broker, even though most of those bytes were rarely read. Tiered storage, introduced as KIP-405 and generally available since Kafka 3.9, breaks that coupling. It keeps a small hot window of recent segments on local disk and continuously offloads older cold segments to cheap, elastic object storage such as Amazon S3 or Google Cloud Storage — giving you near-infinite retention without growing your broker fleet.

How tiered storage works

A Kafka partition is an append-only log split into immutable segments. With tiered storage enabled, a segment that has been fully written and closed becomes eligible to be copied to a remote tier. A background component on each broker uploads the closed segment (plus its index files) to the configured object store, then — once the upload is confirmed — the local copy can be deleted according to local retention. Active segments, and a configurable trailing window of recent closed ones, always stay local so that producers and the consumers tailing the log keep hitting fast local disk.

Reads are transparent to clients. When a consumer fetches an offset that still lives locally, the broker serves it from disk as usual. When the requested offset has aged into the remote tier, the broker fetches the relevant segment range from object storage and streams it back. The consumer’s code does not change at all; it simply experiences higher latency for the cold reads.

Two pluggable interfaces sit behind this: a RemoteStorageManager that knows how to read and write segments to a specific backend, and a RemoteLogMetadataManager that tracks which segments live remotely and at what offsets. Kafka ships a default topic-based metadata manager, and vendors or cloud providers supply storage managers for S3, GCS, Azure Blob, HDFS, and more.

Partition log (offsets increasing ->)

  [ remote segments (cold) ........ ][ local segments (hot) ][ active ]
        S3 / GCS object storage            broker SSD          writes

Local vs remote retention

The crucial mental shift is that retention now has two independent knobs. Local retention controls how long data stays on the broker disk before it is eligible for deletion (assuming it has already been tiered). Overall retention (retention.ms / retention.bytes) still governs the total lifetime of the data across both tiers.

Setting	Scope	Meaning
`retention.ms` / `retention.bytes`	Whole log (local + remote)	Total time/size data is kept before final deletion.
`local.retention.ms`	Local tier only	How long a segment stays on disk after being tiered (default: `retention.ms`).
`local.retention.bytes`	Local tier only	Local disk budget per partition before tiered segments are pruned.
`remote.storage.enable`	Per topic	Switches tiering on for the topic.

Set local.retention.ms to a small window (hours, not days) but keep retention.ms large. That is the whole point: tiny local footprint, huge effective retention. Leaving local.retention.ms at its default equals retention.ms means nothing is ever pruned locally and you gain no disk savings.

Enabling it on the broker

Tiered storage is opt-in at the cluster level, then enabled per topic. In server.properties (KRaft mode), turn on the feature and register a storage backend:

# Cluster-wide: enable remote storage and pick implementations
remote.log.storage.system.enable=true

# Metadata manager (Kafka's built-in topic-based default)
remote.log.metadata.manager.class.name=org.apache.kafka.server.log.remote.metadata.storage.TopicBasedRemoteLogMetadataManager

# Storage manager — supplied by your provider/plugin (example class name)
remote.log.storage.manager.class.name=com.example.kafka.s3.S3RemoteStorageManager
remote.log.storage.manager.class.path=/opt/kafka/plugins/s3/*

# Backend-specific config is namespaced under rsm.config.*
rsm.config.s3.bucket=my-kafka-tiered-bucket
rsm.config.s3.region=us-east-1

After a rolling restart, enable tiering on a topic and set its split retention:

kafka-configs.sh --bootstrap-server localhost:9092 \
  --alter --entity-type topics --entity-name orders \
  --add-config 'remote.storage.enable=true,local.retention.ms=3600000,retention.ms=2592000000'

This keeps roughly one hour of orders data on local disk while retaining thirty days in object storage. Confirm the topic picked up the settings:

kafka-configs.sh --bootstrap-server localhost:9092 \
  --describe --entity-type topics --entity-name orders

Output:

Dynamic configs for topic orders are:
  remote.storage.enable=true sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:remote.storage.enable=true}
  local.retention.ms=3600000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:local.retention.ms=3600000}
  retention.ms=2592000000 sensitive=false synonyms={DYNAMIC_TOPIC_CONFIG:retention.ms=2592000000}

Trade-offs and gotchas

Tiered storage is not free of cost; it trades one resource profile for another. The headline win is dramatically smaller, cheaper local disks and the ability to retain data for months or years for archival, replay, or compliance. The headline cost is read latency for cold data: a consumer that seeks far back into history triggers object-store fetches measured in tens to hundreds of milliseconds rather than sub-millisecond local reads.

A few operational realities to plan for:

Object storage is not infinitely fast. A consumer replaying the entire history of a topic will saturate on remote fetch throughput, and your cloud bill will reflect the GET requests and egress.
Compacted topics are not supported by tiered storage — only delete-retention topics can be tiered. Plan your topic design accordingly.
Replication still happens locally. Tiering does not reduce replica count or inter-broker replication traffic for the hot window; it only shrinks long-tail disk usage.
Recovery and rebalancing get faster. Because each broker holds far less local data, a failed broker re-syncs only the hot window, and partition reassignment moves far fewer bytes.

Tiered storage changes failure modes: if the object store is unavailable, cold reads fail while hot reads keep working. Treat the remote tier as a hard production dependency with its own monitoring and lifecycle policies.

Monitoring the remote tier

Brokers expose JMX metrics specific to tiering. Watch the copy and fetch paths to catch a backlog before it becomes consumer lag:

kafka.server:type=BrokerTopicMetrics,name=RemoteCopyBytesPerSec
kafka.server:type=BrokerTopicMetrics,name=RemoteFetchBytesPerSec
kafka.log.remote:type=RemoteLogManager,name=RemoteLogManagerTasksAvgIdlePercent
kafka.server:type=BrokerTopicMetrics,name=RemoteCopyLagBytes

A growing RemoteCopyLagBytes means segments are being produced faster than they are uploaded — your local disk will fill if the upload backlog never drains, so alert on it.

Best Practices

Keep local.retention.ms small (hours) and retention.ms large; that gap is where all the disk savings live.
Roll out tiered storage on a non-critical topic first and validate cold-read latency against your SLOs before enabling it broadly.
Treat the object store as a first-class production dependency: monitor availability, set bucket lifecycle/versioning policies, and lock down access with least-privilege IAM.
Alert on RemoteCopyLagBytes and remote task idle percentage so an upload backlog never silently fills local disk.
Remember that compacted topics cannot be tiered — design log-compaction topics separately from high-retention event logs.
Right-size your hot window to cover normal consumer catch-up and rebalance scenarios so routine reads never touch object storage.
Test broker recovery and partition reassignment after enabling tiering — both should be markedly faster with less local data to move.