Fault Tolerance & Resilience

In the cloud, hardware fails, networks drop packets, and instances die. You cannot stop these things from happening, so instead you design systems that keep working anyway. A fault-tolerant system continues to serve users even while a component is broken, ideally with no visible impact. This page walks through the building blocks: redundancy, health checks, smart retries, timeouts, circuit breakers, and graceful degradation.

Fault tolerant vs highly available

People use these terms as if they mean the same thing, but they do not. High availability (HA) means the system recovers quickly after a failure, with a small amount of downtime. Fault tolerance is stronger: the system continues through the failure with no interruption at all. Fault tolerance costs more because you usually run extra capacity that sits ready at all times.

Property	Highly available	Fault tolerant
User impact during failure	Brief blip / short downtime	None (transparent)
Recovery model	Detect, then fail over	Already running in parallel
Typical cost	Lower	Higher (idle redundant capacity)
Example	Auto Scaling replaces a dead instance in ~2 min	Two instances behind a load balancer, one dies, traffic flows to the other instantly

When to use which: Demand fault tolerance only for the parts that truly cannot blink, such as a payment-authorization path. For an internal reporting dashboard, plain HA is cheaper and good enough.

Redundancy: have more than one of everything

Redundancy means running multiple copies of a component so that the failure of one does not take down the whole system. On AWS the cleanest way to do this is to spread copies across Availability Zones (AZs — physically separate data centers within an AWS Region).

A typical setup: an Application Load Balancer (ALB — a service that spreads incoming traffic across several targets) in front of an Auto Scaling group whose instances live in two or three AZs. If one AZ goes dark, the ALB stops sending traffic there and the survivors carry the load.

Console steps to make an Auto Scaling group span multiple AZs:

Open the EC2 console, choose Auto Scaling Groups, then Create Auto Scaling group.
Pick your launch template, then Next.
Under Network, select your VPC and choose subnets in at least two different AZs.
Set Desired, Minimum, and Maximum capacity (e.g. min 2 so you always keep a spare).
Attach the group to your ALB’s target group, then Create.

Equivalent AWS CLI:

aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name web-asg \
  --launch-template "LaunchTemplateId=lt-0a1b2c3d4e5f,Version=1" \
  --min-size 2 --max-size 6 --desired-capacity 2 \
  --vpc-zone-identifier "subnet-0a1b2c3d,subnet-0b2c3d4e" \
  --target-group-arns "arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/web-tg/0a1b2c3d4e5f"

Output:

(no output on success; check with:)
aws autoscaling describe-auto-scaling-groups --auto-scaling-group-names web-asg \
  --query "AutoScalingGroups[0].Instances[].{Id:InstanceId,AZ:AvailabilityZone,State:LifecycleState}"
[
  { "Id": "i-0a1b2c3d4e5f", "AZ": "us-east-1a", "State": "InService" },
  { "Id": "i-0b2c3d4e5f6a", "AZ": "us-east-1b", "State": "InService" }
]

Cost note: Keeping a min of 2 instances means you always pay for the spare. A t3.medium runs about $0.0416/hour, so one idle standby is roughly $30/month. That is the price of redundancy.

Health checks and automatic replacement

A health check is a probe that asks a component “are you OK?” on a schedule. If the answer is no, the platform pulls it out of rotation and replaces it. This is what turns redundancy into self-healing.

There are two layers worth configuring:

ELB health checks decide whether the load balancer should send traffic to a target. Point these at a real /health endpoint, not just the TCP port, so you catch a hung app that is still listening.
Auto Scaling health checks decide whether to terminate and replace an instance. Set the Auto Scaling group’s health check type to ELB so a target that fails the ALB check gets replaced, not just removed from rotation.

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name web-asg \
  --health-check-type ELB \
  --health-check-grace-period 90

The --health-check-grace-period (90 seconds above) gives a fresh instance time to boot and warm up before health checks count against it. Set it too low and healthy new instances get killed in a loop.

Retries with exponential backoff and jitter

When a call fails because of a brief blip (a momentary network drop, a throttling response), retrying often succeeds. But how you retry matters enormously.

Naive retries are dangerous. If 10,000 clients all fail at the same instant and all retry after exactly 1 second, they hammer the recovering service simultaneously and knock it over again. This is the thundering-herd problem, also called a retry storm — your retries become a self-inflicted denial-of-service attack.

The fix has two parts:

Exponential backoff: wait longer after each failure (e.g. 1s, 2s, 4s, 8s) instead of retrying immediately.
Jitter: add a random amount to each wait so clients spread out instead of retrying in lockstep.

import random, time

def call_with_retry(operation, max_attempts=5, base=1.0, cap=20.0):
    for attempt in range(max_attempts):
        try:
            return operation()
        except TransientError:
            if attempt == max_attempts - 1:
                raise
            # exponential backoff capped at `cap`, with full jitter
            backoff = min(cap, base * (2 ** attempt))
            time.sleep(random.uniform(0, backoff))

Gotcha: Only retry transient errors (HTTP 429, 500, 503, timeouts). Never retry a 400 Bad Request or a 403 — the input is wrong and retrying just wastes time. The AWS SDKs already implement backoff with jitter for you; the snippet above shows the pattern for your own service-to-service calls.

Timeouts and circuit breakers

A timeout is the maximum time you will wait for a response before giving up. Without one, a slow downstream dependency causes your threads to pile up waiting, and eventually your own service runs out of resources and falls over. Always set a timeout on every network call (often 1-5 seconds for internal calls).

A circuit breaker goes further. After it sees too many failures from a dependency, it “opens” and fails fast for a while instead of retrying a service it knows is down. After a cooldown it lets one trial request through (“half-open”); if that succeeds, it closes again.

State	Behavior
Closed	Calls flow normally; failures are counted.
Open	Calls fail immediately without hitting the dependency.
Half-open	One trial call allowed; success closes, failure re-opens.

Use a circuit breaker when a dependency can be slow or flaky and you would rather degrade than wait. Libraries like Resilience4j (Java) or opossum (Node) implement this; AWS App Mesh and many service meshes provide it at the network layer.

Graceful degradation

When something does break, a resilient system keeps offering a reduced service rather than showing an error page. If the recommendations service is down, show a generic best-sellers list. If live inventory is unavailable, serve a slightly stale cached count. The user still gets a working page.

Practical AWS patterns for this:

Serve a cached response from Amazon ElastiCache or a CDN when the live source fails.
Queue writes with Amazon SQS so the front end stays responsive even if the backend is briefly down.
Return sensible defaults from feature flags so non-critical features switch off cleanly.

Best Practices

Spread every tier across at least two Availability Zones; a single-AZ deployment is not fault tolerant.
Point health checks at a real application endpoint, and set the Auto Scaling health check type to ELB so bad instances are replaced.
Always use exponential backoff with jitter — backoff alone still lets clients synchronize into a retry storm.
Set an explicit timeout on every network call, and only retry errors that are actually transient.
Add circuit breakers around flaky dependencies so a slow service fails fast instead of dragging you down.
Design fallbacks (cached data, queues, defaults) so users see degraded service rather than errors.
Reserve true fault tolerance for paths that cannot blink; use plain high availability elsewhere to control cost.