AWS Best Practices Checklist
Most AWS problems are not caused by missing some exotic feature. They come from the basics quietly slipping: a permission that grew too broad, a resource nobody tagged, a bucket that became public, or a budget alarm that was never set. This page is a single consolidated checklist across the four areas that matter most, identity, networking, data, reliability, cost, and operations. Treat it as something you re-run every quarter, not a one-time setup, because drift always creeps back.
The single highest-value habit is periodically re-auditing the basics. Untagged resources, orphaned spend, and over-broad permissions all return over time. A 30-minute review every quarter prevents almost every expensive surprise.
Identity (IAM)
IAM (Identity and Access Management) controls who can do what in your account. Get this wrong and nothing else matters.
When to use what: use IAM roles (temporary, auto-rotating credentials) for anything running on AWS. Use long-lived access keys (a permanent programmatic username/password) only for legacy tools that genuinely cannot assume a role. Never use the root user (the email you signed up with) for daily work.
Checklist:
- Enable MFA (Multi-Factor Authentication, a second proof of identity) on the root user and every human user.
- Delete any root access keys entirely.
- Apply least privilege, grant only the permissions each identity needs, never
AdministratorAccessby default. - Prefer roles over keys; attach an instance profile to EC2 (a virtual server) rather than baking in keys.
Find users without MFA from the CLI:
aws iam generate-credential-report
aws iam get-credential-report --query 'Content' --output text | base64 -d | awk -F, 'NR==1 || $8=="false"'
Output:
user,arn,user_creation_time,password_enabled,...,mfa_active,...
deploy-bot,arn:aws:iam::123456789012:user/deploy-bot,2025-02-11T09:00:00+00:00,true,...,false,...
Any row with mfa_active=false for a human is a finding to fix.
Networking
A network should expose as little as possible. The pattern: put servers in private subnets (no direct internet route) and only put load balancers and bastions in public subnets.
When to use this: any workload that does not need to receive inbound traffic from the internet directly, which is almost everything except your load balancer.
Checklist:
- Keep application and database tiers in private subnets.
- Make security groups (virtual firewalls around resources) least-exposure: reference other security groups, not
0.0.0.0/0, for app-to-app traffic. - Open SSH/RDP only via Session Manager or a bastion host, never to the whole internet.
Audit security groups open to the world on the CLI:
aws ec2 describe-security-groups \
--filters Name=ip-permission.cidr,Values=0.0.0.0/0 \
--query 'SecurityGroups[].{ID:GroupId,Name:GroupName}'
Output:
[
{ "ID": "sg-0a1b2c3d", "Name": "legacy-web-sg" }
]
Review every group returned, an open 0.0.0.0/0 rule on port 22 or 3389 is almost always a mistake.
Data
Data should be encrypted by default and never accidentally public.
| Control | Default to enable | Why it matters |
|---|---|---|
| S3 Block Public Access | Account-wide ON | Prevents accidental public buckets |
| Encryption at rest (KMS) | On for S3, EBS, RDS | Protects data if storage is exposed |
| Encryption in transit (TLS) | Enforce HTTPS | Protects data on the wire |
| Backups | Automated, tested restores | A backup you never restored is a guess |
Turn on account-wide S3 Block Public Access:
aws s3control put-public-access-block \
--account-id 123456789012 \
--public-access-block-configuration \
BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
Output: (no output on success)
A backup is only real once you have restored it. Schedule a restore test, an EBS snapshot or RDS restore, at least twice a year.
Reliability
Reliability means surviving the failure of a single server or even a whole data center.
When to use this: any workload where downtime costs money or trust. Single-AZ is fine only for throwaway dev environments.
Checklist:
- Run across at least two Availability Zones (AZs, isolated data centers in a region).
- Use Multi-AZ for RDS (a managed database) so a standby takes over automatically.
- Put compute behind an Auto Scaling Group (ASG, which replaces unhealthy instances and adjusts capacity).
- Configure health checks so the load balancer stops sending traffic to bad instances.
aws autoscaling describe-auto-scaling-groups \
--query 'AutoScalingGroups[].{Name:AutoScalingGroupName,Min:MinSize,AZs:AvailabilityZones}'
Output:
[
{ "Name": "web-asg", "Min": 2, "AZs": ["us-east-1a", "us-east-1b"] }
]
A Min of at least 2 across two AZs means a single-AZ outage will not take you down.
Cost
Cloud spend grows quietly. The biggest leaks are orphaned resources (unattached volumes, idle load balancers) and oversized instances.
Checklist:
- Set an AWS Budget with email/SNS alerts so a runaway bill is caught in hours, not at month end.
- Right-size with Compute Optimizer; an idle
m5.4xlargecan cost ~$560/month for nothing. - Tag every resource (for example
Owner,Environment,CostCenter) so spend is attributable. - Delete orphans: unattached EBS volumes, old snapshots, unused Elastic IPs (an Elastic IP is a permanent public IP address, and an unattached one bills ~$3.60/month).
Find unattached EBS volumes still costing money:
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[].{ID:VolumeId,SizeGB:Size}'
Output:
[
{ "ID": "vol-0a1b2c3d4e5f", "SizeGB": 100 }
]
A 100 GB gp3 volume in available (unattached) state still costs roughly $8/month, delete it if nothing needs it.
Operations
Operations is about making changes safe, repeatable, and observable.
Checklist:
- Manage infrastructure as code (IaC) with CloudFormation or Terraform, so changes are reviewed and reproducible.
- Enable CloudTrail (records every API call) in all regions for an audit trail.
- Set up CloudWatch alarms and dashboards for the metrics that page you.
- Detect drift, when the live infrastructure no longer matches your code.
A minimal IaC example that tags consistently:
Resources:
AppBucket:
Type: AWS::S3::Bucket
Properties:
BucketName: devcraftly-app-data
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
Tags:
- { Key: Owner, Value: platform-team }
- { Key: Environment, Value: prod }
Detect drift on a deployed CloudFormation stack:
aws cloudformation detect-stack-drift --stack-name app-prod
Output:
{
"StackDriftDetectionId": "a1b2c3d4-0000-1111-2222-333344445555"
}
Then check the result, a DRIFTED status means someone changed resources outside your code.
Best Practices
- Re-audit the basics every quarter, drift, untagged resources, and permission creep always return.
- Enforce MFA everywhere and prefer short-lived roles over permanent access keys.
- Keep S3 Block Public Access on account-wide and encrypt data at rest and in transit by default.
- Run across at least two AZs with Auto Scaling and health checks for anything that matters.
- Set budgets and a tagging policy on day one so cost is visible and attributable.
- Manage everything as code, turn on CloudTrail, and alarm on the metrics that wake you up.
- Schedule restore tests, an untested backup is not a backup.