Project: Real-Time Analytics Pipeline

Real-time analytics turns a firehose of raw events into actionable metrics within seconds, which is exactly what powers live dashboards, anomaly detection, and usage-based billing in production. In this project you will build a clickstream pipeline: a producer emits page-view events, a Kafka Streams application aggregates them into per-page counts over one-minute windows, materializes the result into a local state store, and exposes those counts through an interactive-query REST endpoint. The whole pipeline is stateful, fault-tolerant, and horizontally scalable without any external database.

Architecture overview

Events flow from a producer into a clicks topic. A Kafka Streams app keys records by page, groups them, and applies a tumbling time window. The aggregation is materialized into a queryable state store backed by a changelog topic, so state survives restarts and rebalances. A REST controller reads directly from that store using interactive queries.

[producer] --> clicks topic --> [Kafka Streams: groupBy + windowedBy + count]
                                        |
                                        v
                          materialized state store (RocksDB)
                                        |
                                        v
                              REST /api/stats (interactive query)

Project setup

Use Spring Boot 3.x with Spring for Apache Kafka and Java 17+. The relevant dependencies:

<dependency>
  <groupId>org.springframework.kafka</groupId>
  <artifactId>spring-kafka</artifactId>
</dependency>
<dependency>
  <groupId>org.apache.kafka</groupId>
  <artifactId>kafka-streams</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-web</artifactId>
</dependency>

Configure the Streams application with a stable application.id (it drives consumer group, changelog, and store naming) and a default key/value SerDe:

spring:
  kafka:
    bootstrap-servers: localhost:9092
    streams:
      application-id: clickstream-analytics
      properties:
        default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
        commit.interval.ms: 1000
        num.standby.replicas: 1

The event model

A click event carries the page path and an event timestamp. Records use a Java record for the DTO:

public record ClickEvent(String page, long timestamp) {}

Producing click events

A scheduled producer simulates traffic across a handful of pages. The key is the page so all events for a page land in the same partition, preserving per-key ordering:

@Component
public class ClickProducer {

    private static final String[] PAGES = {"/home", "/pricing", "/docs", "/blog"};
    private final KafkaTemplate<String, ClickEvent> template;
    private final Random random = new Random();

    public ClickProducer(KafkaTemplate<String, ClickEvent> template) {
        this.template = template;
    }

    @Scheduled(fixedRate = 200)
    public void emit() {
        String page = PAGES[random.nextInt(PAGES.length)];
        template.send("clicks", page, new ClickEvent(page, System.currentTimeMillis()));
    }
}

The Streams topology

The topology groups by key, applies a one-minute tumbling window with a grace period for late data, and counts into a named, materialized store. Naming the store (page-counts) is what makes it reachable via interactive queries.

@Configuration
@EnableKafkaStreams
public class AnalyticsTopology {

    public static final String STORE = "page-counts";

    @Bean
    public KStream<String, ClickEvent> buildPipeline(StreamsBuilder builder) {
        JsonSerde<ClickEvent> clickSerde = new JsonSerde<>(ClickEvent.class);

        KStream<String, ClickEvent> stream =
                builder.stream("clicks", Consumed.with(Serdes.String(), clickSerde));

        stream.groupByKey(Grouped.with(Serdes.String(), clickSerde))
              .windowedBy(TimeWindows.ofSizeAndGrace(
                      Duration.ofMinutes(1), Duration.ofSeconds(10)))
              .count(Materialized.<String, Long, WindowStore<Bytes, byte[]>>as(STORE)
                      .withKeySerde(Serdes.String())
                      .withValueSerde(Serdes.Long()));

        return stream;
    }
}

Tip: a tumbling window only emits its final count after the window closes, but the materialized store is continuously updated. Interactive queries read these intermediate values, so your REST endpoint shows live, in-flight counts rather than waiting for window close.

Serving results via interactive queries

StreamsBuilderFactoryBean exposes the running KafkaStreams instance. From it you obtain a read-only ReadOnlyWindowStore and fetch the counts for a given window range:

@RestController
@RequestMapping("/api/stats")
public class StatsController {

    private final StreamsBuilderFactoryBean factory;

    public StatsController(StreamsBuilderFactoryBean factory) {
        this.factory = factory;
    }

    @GetMapping("/{page}")
    public Map<Long, Long> counts(@PathVariable String page) {
        KafkaStreams streams = factory.getKafkaStreams();
        ReadOnlyWindowStore<String, Long> store = streams.store(
                StoreQueryParameters.fromNameAndType(
                        AnalyticsTopology.STORE, QueryableStoreTypes.windowStore()));

        Instant to = Instant.now();
        Instant from = to.minus(Duration.ofMinutes(10));
        Map<Long, Long> result = new TreeMap<>();
        try (WindowStoreIterator<Long> it = store.fetch(page, from, to)) {
            it.forEachRemaining(kv -> result.put(kv.key, kv.value));
        }
        return result;
    }
}

Run instructions

Start a KRaft-mode broker, create the input topic, then run the app:

kafka-topics.sh --create --topic clicks --partitions 3 \
  --bootstrap-server localhost:9092
mvn spring-boot:run

Once traffic is flowing, query a page’s per-minute counts:

curl -s http://localhost:8080/api/stats/%2Fpricing

Output:

{
  "1717243200000": 47,
  "1717243260000": 51,
  "1717243320000": 38
}

Each key is a window-start epoch millisecond and each value is the click count for /pricing during that minute.

Scaling and fault tolerance

Because state is partitioned by key, running multiple instances of the app splits both processing and store ownership across them. The count changelog topic restores state after a crash, and num.standby.replicas: 1 keeps warm copies to shorten failover. When a query targets a key owned by another instance, use KafkaStreams.queryMetadataForKey to discover the host and forward the request.

Best Practices

Always name your materialized stores explicitly; auto-generated names are unstable and break interactive queries.
Add a grace period to windows so legitimately late events are still counted instead of silently dropped.
Use event-time semantics with a TimestampExtractor when records carry their own timestamps, rather than relying on broker ingestion time.
Keep application.id immutable in production; changing it orphans all changelog and state, forcing a full rebuild.
Configure standby replicas to cut rebalance recovery time for stateful, query-heavy workloads.
Forward interactive queries across instances using queryMetadataForKey so any node can serve any key.
Monitor consumer lag and store restoration time; growing lag means your aggregation can’t keep up with ingest.