AWS X-Ray (Distributed Tracing)

When a single user request bounces through an API Gateway, a Lambda function, a database, and two more microservices, a plain log line that says “request was slow” tells you almost nothing. AWS X-Ray is a distributed tracing service: it follows one request end-to-end across all those services and shows you exactly where the time went and where the error happened. Instead of guessing, you see a visual map of your app and the latency of every hop. This is the difference between “the checkout is slow” and “the checkout is slow because the payment service waits 800 ms on a downstream call.”

What X-Ray actually records

X-Ray builds its picture from a few simple pieces. Understanding the vocabulary makes the console make sense.

Trace — the complete journey of one request through your system, identified by a trace ID (a unique string like 1-67e1a2b3-4c5d6e7f8a9b0c1d2e3f4a5b).
Segment — the work done by one service for that request (for example, “the order Lambda ran for 120 ms”).
Subsegment — a slice of work inside a segment, like a single call to DynamoDB or an outbound HTTP request. Subsegments are what let you see “the DB call took 90 of those 120 ms.”
Service map — an auto-generated diagram of your services as circles (nodes) with arrows (edges) showing how requests flow. Nodes turn yellow for high latency and red for errors, so problems jump out visually.
Annotations and metadata — extra key-value data you attach to a segment (e.g. customer_tier = premium) so you can filter traces later.

Gotcha: X-Ray only knows about services you have instrumented. If one service in the chain has no X-Ray SDK or tracing enabled, it shows up as a gap or an “unknown” node in the map, and you lose visibility into that hop. Distributed tracing is all-or-nothing-ish: instrument every service in the critical path.

When to use X-Ray (and when not to)

Use X-Ray when you have more than one service handling a request (microservices, or serverless apps built from Lambda + API Gateway + DynamoDB), and you need to answer “which component is slow or failing?” It shines for latency hunting and for understanding dependencies you forgot you had.

Reach for something else when you only have a single monolithic app on one server — there CloudWatch Logs and metrics are simpler and cheaper. X-Ray is also not a log aggregator; use CloudWatch Logs for full text logs. And it is not real-time alerting; pair it with CloudWatch alarms for that.

How tracing gets into your app

There are two moving parts: an SDK (or the newer OpenTelemetry libraries) in your code that creates the trace data, and the X-Ray daemon — a small process that collects that data and ships it to AWS. On Lambda you do not run the daemon yourself; AWS runs it for you. On Amazon ECS (Elastic Container Service, AWS’s container runner) or EC2 you run the daemon as a sidecar container or a background process.

Lambda

Open the Lambda console and select your function.
Go to Configuration → Monitoring and operations tools.
Click Edit, toggle Active tracing on, and Save.

The CLI equivalent:

aws lambda update-function-configuration \
  --function-name order-processor \
  --tracing-config Mode=Active

Output:

{
    "FunctionName": "order-processor",
    "TracingConfig": {
        "Mode": "Active"
    },
    "LastModified": "2026-06-15T10:22:03.000+0000"
}

That captures the Lambda invocation automatically. To see inside the function (your DynamoDB and HTTP calls as subsegments), wrap your AWS SDK with the X-Ray SDK in code:

from aws_xray_sdk.core import xray_recorder
from aws_xray_sdk.core import patch_all

patch_all()  # auto-instruments boto3, requests, etc.

def handler(event, context):
    with xray_recorder.in_subsegment('validate_order') as sub:
        sub.put_annotation('order_id', event['orderId'])
        # validation logic here
        return {"status": "ok"}

ECS / EC2

Run the daemon as a sidecar. A minimal task definition snippet:

{
  "name": "xray-daemon",
  "image": "amazon/aws-xray-daemon",
  "cpu": 32,
  "memoryReservation": 256,
  "portMappings": [{ "containerPort": 2000, "protocol": "udp" }]
}

Your app container sends segment data to the daemon over UDP port 2000; the daemon batches and forwards it to the X-Ray API. Make sure the task’s IAM role includes the AWSXRayDaemonWriteAccess managed policy, or the daemon cannot upload traces.

By default X-Ray does not trace every request. It uses sampling rules: the default rule records the first request each second plus 5% of the rest. This keeps cost and overhead low, but it has a real consequence — a rare error on a low-traffic endpoint may never be sampled, so it simply will not appear in your traces.

A sampling rule controls this. Key fields:

Field	Meaning	Example
`ReservoirSize`	Guaranteed traces per second before percentage applies	`1`
`FixedRate`	Fraction of remaining requests to sample (0.0–1.0)	`0.05` (5%)
`Priority`	Lower number wins when rules overlap	`1000`
`ServiceName` / `URLPath`	Which requests the rule matches	`payment-api`, `/checkout`

For a critical but low-traffic path like /checkout, raise the rate to 100% so you never miss an error:

aws xray create-sampling-rule --sampling-rule '{
  "RuleName": "checkout-full",
  "Priority": 100,
  "FixedRate": 1.0,
  "ReservoirSize": 5,
  "ServiceName": "*",
  "ServiceType": "*",
  "Host": "*",
  "HTTPMethod": "*",
  "URLPath": "/checkout*",
  "ResourceARN": "*",
  "Version": 1
}'

Output:

{
    "SamplingRuleRecord": {
        "SamplingRule": {
            "RuleName": "checkout-full",
            "Priority": 100,
            "FixedRate": 1.0,
            "ReservoirSize": 5,
            "URLPath": "/checkout*"
        },
        "CreatedAt": "2026-06-15T10:31:00+00:00"
    }
}

Cost tip: X-Ray charges per trace recorded (about $5.00 per 1 million traces recorded in 2026) plus a smaller fee per trace retrieved/scanned. Setting FixedRate to 1.0 everywhere on a high-traffic service can get expensive fast — sample at 100% only on the narrow, low-volume paths you truly care about.

Reading the service map

Open the CloudWatch console (X-Ray now lives under CloudWatch).
In the left nav choose Application Signals → Traces (or X-Ray traces).
Click Service map. Look for nodes that are red (errors) or amber (high latency).
Click a node, then View traces, and open an individual trace to see its segment/subsegment timeline.

To pull traces from the CLI for a time window:

aws xray get-trace-summaries \
  --start-time 1718445600 \
  --end-time 1718449200 \
  --filter-expression 'service("payment-api") AND error'

Best practices

Instrument every service on the critical request path — one un-instrumented hop becomes a blind gap in the map.
Add annotations (indexed, filterable) for things you will search by, like order_id or tenant; use metadata for bulk debug detail you do not filter on.
Keep the default low sampling for chatty, high-volume endpoints, but create high-priority 100% rules for rare, business-critical paths.
Always attach AWSXRayDaemonWriteAccess (or equivalent permissions) to the task/instance role, or traces silently never arrive.
Consider AWS Distro for OpenTelemetry (ADOT) for new projects so your instrumentation is portable beyond X-Ray.
Pair X-Ray with CloudWatch alarms: X-Ray tells you why it is slow, alarms tell you when to look.