Circuit Breaker & Resilience Patterns
In a distributed system, failure is not an exception — it is a steady-state condition you must design for. When one downstream service slows down or dies, naive callers pile up open connections, exhaust their thread or event-loop budget, and drag the whole mesh down with them. Resilience patterns — circuit breakers, retries with backoff, timeouts, bulkheads, and fallbacks — turn a single failing dependency into a contained, graceful degradation instead of a cascading outage.
Why cascading failures happen
Imagine an orders service that calls a payments service synchronously. If payments starts taking 30 seconds to respond, every request to orders now hangs for 30 seconds too. Incoming traffic keeps arriving, sockets stay open, memory climbs, and soon orders is unresponsive — even though its own code is perfectly healthy. The failure has propagated upward.
The fix is to fail fast and fail in isolation. The patterns below each attack a different facet of that goal.
| Pattern | Problem it solves |
|---|---|
| Timeout | Caps how long you wait for any single call |
| Retry + backoff | Recovers from transient blips without stampeding |
| Circuit breaker | Stops calling a dependency that is clearly down |
| Bulkhead | Isolates resource pools so one failure can’t drain all |
| Fallback | Returns a degraded-but-useful response instead of an error |
Always set a timeout
The single most important defense is a hard timeout on every network call. Node’s native fetch accepts an AbortSignal, and AbortSignal.timeout() makes this a one-liner.
async function getQuote(symbol) {
const res = await fetch(`http://pricing/quote/${symbol}`, {
signal: AbortSignal.timeout(2000), // abort after 2s
});
if (!res.ok) throw new Error(`pricing returned ${res.status}`);
return res.json();
}
When the deadline passes the request is aborted and the promise rejects with a TimeoutError, freeing the connection immediately rather than waiting on a hung peer.
Retries with exponential backoff
Many failures are transient — a brief network hiccup, a pod being rescheduled. Retrying helps, but retrying immediately and in lockstep across many callers creates a thundering herd. Use exponential backoff plus jitter so retries spread out over time.
async function withRetry(fn, { retries = 3, baseMs = 100 } = {}) {
let attempt = 0;
for (;;) {
try {
return await fn();
} catch (err) {
attempt += 1;
if (attempt > retries) throw err;
const backoff = baseMs * 2 ** (attempt - 1);
const jitter = Math.random() * baseMs;
await new Promise((r) => setTimeout(r, backoff + jitter));
console.warn(`retry ${attempt}/${retries} after ${err.message}`);
}
}
}
const quote = await withRetry(() => getQuote("ACME"));
Output:
retry 1/3 after pricing returned 503
retry 2/3 after pricing returned 503
Only retry idempotent operations. Replaying a non-idempotent
POST /chargecan double-bill a customer. Pair retries with an idempotency key when the operation has side effects.
Circuit breakers with opossum
A circuit breaker watches the error rate of a dependency. After too many failures it opens, short-circuiting further calls so they fail instantly instead of piling up. After a cooldown it moves to half-open and lets a trial request through; success closes the circuit, failure re-opens it. The mature, battle-tested library for Node is opossum.
npm install opossum
import CircuitBreaker from "opossum";
const options = {
timeout: 2000, // call counts as a failure past 2s
errorThresholdPercentage: 50, // open once 50% of calls fail
resetTimeout: 10000, // try again 10s after opening
volumeThreshold: 5, // need 5 calls before stats apply
};
const breaker = new CircuitBreaker(getQuote, options);
// Degraded response when the circuit is open or the call fails
breaker.fallback(() => ({ symbol: "ACME", price: null, stale: true }));
breaker.on("open", () => console.warn("circuit OPEN — pricing is down"));
breaker.on("halfOpen", () => console.info("circuit HALF-OPEN — probing"));
breaker.on("close", () => console.info("circuit CLOSED — pricing recovered"));
const result = await breaker.fire("ACME");
Output:
circuit OPEN — pricing is down
circuit HALF-OPEN — probing
circuit CLOSED — pricing recovered
The fire() call returns the fallback value the instant the circuit is open, so your latency stays flat even while the dependency is unavailable. opossum also exposes breaker.stats for wiring into Prometheus or your metrics pipeline.
Key opossum options
| Option | Meaning |
|---|---|
timeout | Max ms for the action before it’s a failure |
errorThresholdPercentage | Failure rate that trips the breaker open |
resetTimeout | Ms to stay open before going half-open |
volumeThreshold | Minimum requests before the breaker can trip |
rollingCountTimeout | Width of the rolling stats window |
Bulkheads: isolate resource pools
A bulkhead limits how much of a shared resource any one dependency can consume, so a slow service can’t starve the rest. In practice this means capping concurrency per dependency. A small semaphore does the job.
function bulkhead(maxConcurrent) {
let active = 0;
const queue = [];
const next = () => {
if (active >= maxConcurrent || queue.length === 0) return;
active += 1;
const { fn, resolve, reject } = queue.shift();
Promise.resolve()
.then(fn)
.then(resolve, reject)
.finally(() => {
active -= 1;
next();
});
};
return (fn) =>
new Promise((resolve, reject) => {
queue.push({ fn, resolve, reject });
next();
});
}
const limit = bulkhead(10); // at most 10 in-flight pricing calls
const quote = await limit(() => getQuote("ACME"));
If pricing becomes slow, at most 10 requests wait on it; the eleventh queues without consuming a connection, and every other dependency keeps its own independent budget.
Composing the patterns
These patterns stack. A robust call typically wraps the raw request in a timeout, runs it through a bulkhead, retries transient errors, and sits behind a circuit breaker with a fallback. Because opossum’s timeout and fallback cover two layers already, you usually only add backoff retries inside the breaker’s action and a bulkhead around the whole thing.
Best practices
- Put a hard timeout on every outbound call — an unbounded wait is the root of most cascading failures.
- Only retry idempotent operations, and always add jitter to backoff to avoid synchronized retry storms.
- Tune the circuit breaker per dependency; a payment gateway and a recommendation service have very different failure tolerances.
- Always supply a fallback so an open circuit degrades gracefully (cached data, a default, or a clear partial response) instead of 500-ing.
- Use bulkheads to isolate concurrency per dependency so one slow service cannot exhaust the whole event loop.
- Emit breaker state changes and stats as metrics — an open circuit is an early-warning signal, not just an internal detail.
- Test failure paths explicitly with fault injection; resilience code that is never exercised rots silently.