AWS Step Functions
Most real serverless applications are not a single function. They are a sequence of steps: validate an order, charge a card, reserve inventory, send a confirmation email. AWS Step Functions is a service that runs these multi-step workflows for you as a visual state machine (a diagram where each box is one step and arrows show what happens next). Instead of writing fragile code where one Lambda function calls the next, you describe the flow once, and Step Functions handles the sequencing, retries, error handling, and parallel branches for you.
What problem does it solve?
Imagine you build an order pipeline by having Lambda A invoke Lambda B, which invokes Lambda C. This is called chaining and it is a trap. If Lambda B fails halfway, you have no record of where the workflow stopped. There is no built-in retry. You cannot see, at a glance, which orders are stuck. You end up writing your own state-tracking code in a database, plus retry loops, plus timeout handling, in every function.
Gotcha: Do not orchestrate by chaining Lambdas that invoke each other directly. It is brittle, gives you zero visibility, and forces you to reinvent retries and error handling by hand. That is exactly the job Step Functions was built for.
Step Functions replaces all of that. You define the steps, and the service gives you a live visual graph of every running and finished execution, automatic retries with backoff, catch-and-handle for errors, branching (if/else), and parallel execution, all without writing plumbing code.
When to use this (and when not to)
| Situation | Use Step Functions? |
|---|---|
| A workflow with 3+ steps that must run in order | Yes |
| Steps that need retries, error branches, or human approval | Yes |
| You need to see/audit each execution’s progress | Yes |
| Long-running jobs (minutes to a year) | Yes (Standard) |
| A single Lambda doing one self-contained task | No — just use Lambda |
| Pure event fan-out with no ordering | No — use SNS/EventBridge |
| Simple queue-and-process | No — use SQS + Lambda |
How a state machine works
A state machine is written in ASL (Amazon States Language, a JSON format). Each state is one step. The most common types are:
- Task — do work (invoke a Lambda, call DynamoDB, run an ECS task, etc.).
- Choice — branch based on data (if/else).
- Parallel — run multiple branches at the same time.
- Map — loop over a list of items, processing each (optionally in parallel).
- Wait — pause for a set time or until a timestamp.
- Pass / Succeed / Fail — shape data or end the workflow.
Here is a small workflow that validates an order, then charges it, with an automatic retry and an error catch:
{
"Comment": "Order processing workflow",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111122223333:function:validate-order",
"Next": "ChargeCard"
},
"ChargeCard": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111122223333:function:charge-card",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "RefundAndNotify"
}
],
"End": true
},
"RefundAndNotify": {
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:111122223333:function:refund",
"End": true
}
}
}
Notice what you get for free: Retry tries the charge up to 3 times with exponential backoff (2s, 4s, 8s), and Catch routes any unrecoverable failure to a refund step. You wrote no retry loop and no try/catch plumbing.
Standard vs Express workflows
Step Functions offers two workflow types, and choosing wrong can cost you money or fail your use case.
| Standard | Express | |
|---|---|---|
| Max duration | Up to 1 year | Up to 5 minutes |
| Billing model | Per state transition | Per request + duration/memory |
| Execution history | Full, visible in console for 90 days | Logged to CloudWatch (no built-in visual history) |
| Exactly-once / at-least-once | Exactly-once | At-least-once |
| Best for | Long, low-to-medium volume, auditable workflows | High-volume, short-lived flows (e.g. streaming, IoT ingest) |
Cost gotcha: Standard bills roughly $25 per million state transitions. A workflow with 6 states run 10 million times a month = 60 million transitions ≈ $1,500/month. The same volume on Express (which bills per duration, e.g. a 100ms/64MB run costs a tiny fraction of a cent) is dramatically cheaper. Pick Express for high-volume, short flows; pick Standard when you need long durations and the full audit trail.
Creating a state machine
Console steps
- Open the Step Functions console and choose State machines → Create state machine.
- Choose Create from blank, then pick Standard or Express.
- Use Workflow Studio (the drag-and-drop designer) to add Task states and connect them, or paste your ASL JSON in the Code tab.
- Click Config, give it a name (e.g.
order-processing), and let it Create a new role (this auto-grants permission to invoke your Lambdas). - Click Create.
AWS CLI
aws stepfunctions create-state-machine \
--name order-processing \
--definition file://order-workflow.json \
--role-arn arn:aws:iam::111122223333:role/StepFunctionsExecRole \
--type STANDARD
Output:
{
"stateMachineArn": "arn:aws:states:us-east-1:111122223333:stateMachine:order-processing",
"creationDate": "2026-06-15T10:42:00.123000-04:00"
}
Start an execution and check its result:
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:111122223333:stateMachine:order-processing \
--input '{"orderId":"o-0a1b2c3d","amount":49.99}'
Output:
{
"executionArn": "arn:aws:states:us-east-1:111122223333:execution:order-processing:run-7f3a",
"startDate": "2026-06-15T10:43:11.500000-04:00"
}
Infrastructure as Code
Defining the state machine in CloudFormation keeps it version-controlled and repeatable:
Resources:
OrderWorkflow:
Type: AWS::StepFunctions::StateMachine
Properties:
StateMachineName: order-processing
StateMachineType: STANDARD
RoleArn: arn:aws:iam::111122223333:role/StepFunctionsExecRole
DefinitionUri: order-workflow.json
Best Practices
- Use Step Functions for any workflow of 3 or more ordered steps instead of chaining Lambdas that call each other.
- Add
Retryblocks to Task states for transient failures (throttling, timeouts) andCatchblocks to route unrecoverable errors to a cleanup or alert step. - Choose Express for high-volume, short (under 5-minute) flows to save money; choose Standard for long-running or audit-critical workflows.
- Keep individual Lambda steps small and single-purpose so each shows up clearly in the visual execution graph.
- Use the Map state for batch processing instead of looping inside one giant Lambda — it parallelizes and stays visible.
- Pass only the data each step needs using ASL’s input/output filters (
InputPath,ResultPath) to keep payloads small (state data has a 256 KB limit). - Enable CloudWatch logging for Express workflows, since they have no built-in execution history in the console.