Skip to content
AWS aws architecture 6 min read

Scalability Patterns

Scalability is the ability of a system to handle more work (more users, more requests, more data) by adding resources. On AWS this is the difference between an app that falls over during a traffic spike and one that quietly grows to meet demand. The good news is that the cloud makes adding resources cheap and fast, but only if your architecture is designed for it. This page walks through the two ways to scale, why stateless application tiers are the secret to scaling well, and the four patterns you will reach for most often.

Horizontal vs vertical scaling

There are two fundamental ways to give a system more capacity.

Vertical scaling (scaling up) means making a single machine bigger — more CPU (the processor that does the work), more RAM (short-term memory), faster disk. On AWS this is as simple as changing the instance type, for example moving an EC2 instance (a virtual server) from a t3.medium to an m6i.2xlarge.

Horizontal scaling (scaling out) means adding more machines and spreading the work across them. Instead of one big server you run ten small ones behind a load balancer (a device that distributes incoming traffic).

AspectVertical scaling (up)Horizontal scaling (out)
HowBigger instance typeMore instances
CeilingLimited — largest instance is finiteEffectively unlimited
DowntimeUsually needs a restartNone — add/remove live
ResilienceSingle point of failureSurvives losing a node
Best forDatabases, legacy appsWeb/app tiers, APIs

Gotcha: Vertical scaling always hits a ceiling — eventually there is no bigger instance to buy — and a single large server is a single point of failure: if it dies, your whole app is down. Design for horizontal scale from the start.

When to use which: Reach for vertical scaling when a workload genuinely cannot be split across machines (many traditional relational databases, or licensed software that runs as one process). Use horizontal scaling for everything that can run as multiple copies — especially web and API tiers.

Why stateless tiers scale better

A server is stateless when it stores no per-user data locally between requests. Any instance can handle any request because nothing important lives only on that one box. A stateful server keeps things like the user’s login session in its own memory, so that user must keep coming back to the same instance.

Statelessness is what makes horizontal scaling actually work. If session data lives on instance A and the load balancer sends the next request to instance B, a stateful app logs the user out. If you scale in (remove an instance), you lose whatever state it held.

The fix is to push state out of the app tier into a shared store:

  • Sessions / cache: Amazon ElastiCache (a managed in-memory cache, Redis or Memcached).
  • Persistent state: Amazon DynamoDB (a managed NoSQL key-value database) or a relational database.
  • Uploaded files: Amazon S3 (object storage), never the local disk.

When NOT to worry: If you run a single instance and never scale, statelessness matters less — but you have also accepted a single point of failure, so it is rarely the right long-term choice.

Pattern 1: Auto Scaling Group + load balancer

This is the workhorse pattern for web and API tiers. An Auto Scaling Group (ASG) keeps a target number of identical EC2 instances running and adds or removes them based on load. An Application Load Balancer (ALB) spreads traffic across whatever instances are currently healthy.

Console steps:

  1. Open the EC2 console and choose Launch TemplatesCreate launch template. Specify the AMI (ami-0abcdef1234567890), instance type, and security group (sg-0a1b2c3d).
  2. Go to Auto Scaling GroupsCreate Auto Scaling group, pick the launch template, and select your subnets (subnet-0a1b2c3d in two or more Availability Zones).
  3. Attach a new or existing Application Load Balancer and target group.
  4. Set group size: desired 2, minimum 2, maximum 10.
  5. Add a target tracking scaling policy: keep average CPU at 50%.

CLI equivalent:

aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name web-asg \
  --launch-template LaunchTemplateId=lt-0a1b2c3d,Version='$Latest' \
  --min-size 2 --max-size 10 --desired-capacity 2 \
  --vpc-zone-identifier "subnet-0a1b2c3d,subnet-0b2c3d4e" \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:111122223333:targetgroup/web-tg/abc123

aws autoscaling put-scaling-policy \
  --auto-scaling-group-name web-asg \
  --policy-name cpu50 --policy-type TargetTrackingScaling \
  --target-tracking-configuration \
  '{"PredefinedMetricSpecification":{"PredefinedMetricType":"ASGAverageCPUUtilization"},"TargetValue":50.0}'

Output:

{
    "PolicyARN": "arn:aws:autoscaling:us-east-1:111122223333:scalingPolicy:...:autoScalingGroupName/web-asg:policyName/cpu50",
    "Alarms": [
        {"AlarmName": "TargetTracking-web-asg-AlarmHigh-...", "AlarmARN": "arn:aws:cloudwatch:..."},
        {"AlarmName": "TargetTracking-web-asg-AlarmLow-...", "AlarmARN": "arn:aws:cloudwatch:..."}
    ]
}

Cost note: You pay only for running instances. With min 2 and a spike to 10 for one hour a day on t3.medium (about $0.042/hour), the extra 8 instances cost roughly $10/month — far cheaper than running 10 instances around the clock.

Pattern 2: Queue-based load leveling

When producers create work faster than consumers can process it, put a queue between them. Amazon SQS (Simple Queue Service, a managed message queue) absorbs bursts: producers drop messages in instantly, and a fleet of workers pulls them off at a steady pace. The queue smooths spikes so your backend never gets overwhelmed.

aws sqs send-message \
  --queue-url https://sqs.us-east-1.amazonaws.com/111122223333/orders \
  --message-body '{"orderId":"A-1001","total":49.99}'

Output:

{
    "MD5OfMessageBody": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
    "MessageId": "c1f0d8a2-3b4c-4d5e-9f01-23456789abcd"
}

You can even scale the worker ASG on the queue depth (the ApproximateNumberOfMessagesVisible CloudWatch metric), so more messages automatically launch more workers. When to use: any spiky, asynchronous workload — order processing, image resizing, email sending. When NOT to: strictly synchronous requests where the user is waiting for an immediate answer.

Pattern 3: Caching

A cache stores the answer to expensive work so you do not redo it. Caching reduces load on your database and speeds up responses.

  • Amazon CloudFront (a CDN — content delivery network) caches static files and API responses at edge locations near users.
  • Amazon ElastiCache caches database query results and session data in memory.

A common pattern is cache-aside: the app checks the cache first, and only queries the database on a miss, then stores the result for next time.

Pattern 4: Database read replicas and sharding

The database is usually the real bottleneck. App tiers scale out easily, but a single database does not. Plan your database scaling before the app tier becomes the limit.

Two techniques:

  • Read replicas: Amazon RDS (Relational Database Service) can create read-only copies of your database. Send all reads to replicas and writes to the primary. Great when reads vastly outnumber writes.
  • Sharding: Split data across multiple databases by a key (for example, user ID ranges). This scales writes too, but adds application complexity. DynamoDB does this automatically under the hood.
aws rds create-db-instance-read-replica \
  --db-instance-identifier app-db-replica-1 \
  --source-db-instance-identifier app-db-primary

Output:

{
    "DBInstance": {
        "DBInstanceIdentifier": "app-db-replica-1",
        "DBInstanceStatus": "creating",
        "ReadReplicaSourceDBInstanceIdentifier": "app-db-primary",
        "Engine": "postgres"
    }
}

When to use: Add caching and read replicas as soon as read traffic grows; reach for sharding only when a single primary can no longer handle writes, because it is the hardest pattern to undo.

Best practices

  • Default to horizontal scaling; treat vertical scaling as a stopgap because it has a ceiling and is a single point of failure.
  • Keep every app tier stateless — push sessions and state to ElastiCache or DynamoDB and uploads to S3.
  • Use target-tracking scaling policies (for example CPU at 50%) rather than guessing fixed instance counts.
  • Spread instances across at least two Availability Zones so losing one zone does not take you down.
  • Put SQS between bursty producers and slower consumers to level load.
  • Offload reads to a cache and read replicas early — the database is the most common scaling wall.
  • Set sensible ASG maximums and alarms so a runaway scale-out does not surprise you on the bill.
Last updated June 15, 2026
Was this helpful?