Saga Pattern
A business process like “place an order” often spans multiple services — order, payment, inventory — each owning its own database. You cannot wrap that in a single ACID transaction, and two-phase commit (2PC) is a non-starter at scale because it locks resources across services and collapses when one participant is slow or down. The saga pattern solves this by breaking the workflow into a sequence of local transactions, where each step publishes an event that triggers the next, and any failure is undone by running compensating transactions that semantically reverse the steps already committed.
Why not two-phase commit
2PC gives you atomicity by holding locks until every participant votes to commit. In a distributed, event-driven system that means cross-service locks, head-of-line blocking, and a coordinator that becomes a single point of failure. Sagas trade strict atomicity for eventual consistency: each local transaction commits immediately, and correctness is restored through compensation rather than rollback. There is a window where the system is partially complete — your APIs and UI must tolerate “pending” states.
Warning: A saga is not a rollback. Once a local transaction commits, it is durable. “Undoing” it means issuing a new compensating transaction (refund a payment, release reserved stock) — a semantic reversal, not a database rollback.
Choreography vs orchestration
There are two ways to drive a saga.
Choreography is fully decentralized: each service listens for the previous step’s event, does its local work, and emits its own event. The workflow emerges from the chain of reactions, with no central brain. Orchestration introduces a coordinator (the saga orchestrator) that sends commands to each service and tracks the overall state of the transaction in one place.
| Aspect | Choreography | Orchestration |
|---|---|---|
| Control | Distributed across services | Centralized coordinator |
| Coupling | Lowest — services know only events | Higher — orchestrator knows the flow |
| Visibility | Implicit, harder to trace | Explicit, queryable state machine |
| Adding a step | Add a consumer, edit downstream | Edit the orchestrator |
| Failure handling | Each service emits failure events | Orchestrator drives compensation |
| Best for | Short flows (3-4 steps) | Long, complex flows with many branches |
This page focuses on choreography over Kafka topics, which is the most natural fit for an event-driven system.
An order saga over Kafka
Consider placing an order: reserve inventory, charge payment, then confirm. If payment fails, the reserved inventory must be released. Each service owns one topic for its outcomes.
┌───────────┐ OrderCreated ┌─────────────┐ InventoryReserved ┌────────────┐
│ Order ├────────────────►│ Inventory ├──────────────────►│ Payment │
│ Service │ │ Service │ │ Service │
└─────▲─────┘ └──────▲──────┘ └──────┬─────┘
│ │ │
│ OrderConfirmed │ InventoryReleased │ PaymentCaptured
│ / OrderCancelled │ (compensation) │ / PaymentFailed
└──────────────────────────────┴─────────────────────────────────┘
On the happy path, events flow left to right and the order is confirmed. On a PaymentFailed, the inventory service consumes it and runs its compensating action, then the order service cancels the order.
Model every event as an immutable, past-tense record. A shared sagaId (the order id) keys all topics so events for one saga land on the same partition and stay ordered.
public record OrderCreated(String orderId, String sku, int quantity, BigDecimal amount) {}
public record InventoryReserved(String orderId, String sku, int quantity) {}
public record PaymentCaptured(String orderId, BigDecimal amount) {}
public record PaymentFailed(String orderId, String reason) {}
public record InventoryReleased(String orderId, String sku, int quantity) {}
The inventory service reserves stock and emits the next event — and also listens for the failure event to compensate:
@Component
public class InventorySaga {
private final InventoryRepository inventory;
private final KafkaTemplate<String, Object> kafka;
public InventorySaga(InventoryRepository inventory, KafkaTemplate<String, Object> kafka) {
this.inventory = inventory;
this.kafka = kafka;
}
@KafkaListener(topics = "order.created", groupId = "inventory-service")
public void onOrderCreated(OrderCreated event) {
inventory.reserve(event.sku(), event.quantity());
kafka.send("inventory.reserved", event.orderId(),
new InventoryReserved(event.orderId(), event.sku(), event.quantity()));
}
// Compensating transaction: triggered when payment fails downstream
@KafkaListener(topics = "payment.failed", groupId = "inventory-service")
public void onPaymentFailed(PaymentFailed event, @Header(KafkaHeaders.RECEIVED_KEY) String orderId) {
inventory.releaseByOrder(orderId);
kafka.send("inventory.released", orderId, new InventoryReleased(orderId, null, 0));
}
}
The payment service decides the saga’s outcome:
@Component
public class PaymentSaga {
private final PaymentGateway gateway;
private final KafkaTemplate<String, Object> kafka;
public PaymentSaga(PaymentGateway gateway, KafkaTemplate<String, Object> kafka) {
this.gateway = gateway;
this.kafka = kafka;
}
@KafkaListener(topics = "inventory.reserved", groupId = "payment-service")
public void onInventoryReserved(InventoryReserved event) {
String orderId = event.orderId();
try {
gateway.charge(orderId);
kafka.send("payment.captured", orderId, new PaymentCaptured(orderId, BigDecimal.ZERO));
} catch (PaymentException ex) {
kafka.send("payment.failed", orderId, new PaymentFailed(orderId, ex.getMessage()));
}
}
}
Finally the order service closes the loop, confirming on success and cancelling on the release event.
Output:
order.created orderId=ord-42 sku=SKU-9 qty=2 amount=59.90
inventory.reserved orderId=ord-42 sku=SKU-9 qty=2
payment.failed orderId=ord-42 reason=card_declined
inventory.released orderId=ord-42 <-- compensation ran
order.cancelled orderId=ord-42
Handling failures and idempotency
Kafka delivers at-least-once, so every saga step can be retried and may run twice. Each handler must be idempotent: guard reservations and charges by orderId so re-delivery is a no-op. Track per-saga state (STARTED, INVENTORY_RESERVED, COMPLETED, COMPENSATING, CANCELLED) in a local table to reject out-of-order or duplicate events. Use a dead-letter topic for poison messages so one bad event never wedges the partition.
Best Practices
- Key every saga topic by the saga id (the order id) so all steps for one transaction stay ordered on a single partition.
- Make each step and each compensation idempotent — at-least-once delivery guarantees retries.
- Write the local transaction and its outgoing event atomically using the outbox pattern, not a dual write.
- Persist saga state explicitly so you can resume, audit, and reason about in-flight transactions.
- Design compensating actions for every step that has externally visible effects (refunds, stock release, email retractions).
- Prefer choreography for short flows; switch to an orchestrator once the flow has many branches or needs central visibility.
- Set timeouts: if a step’s event never arrives, trigger compensation rather than leaving the saga stuck forever.