Amazon Redshift (Data Warehouse)

Amazon Redshift is a fully managed data warehouse — a database built for answering big analytical questions over huge piles of data, like “what were total sales by region for every month in the last three years?”. It is columnar (it stores each column together on disk) and MPP (Massively Parallel Processing — it splits one query across many machines that work at the same time). This makes Redshift fast at scanning billions of rows, but it is the wrong tool for the fast, single-row reads and writes that a normal app database does all day.

OLAP vs OLTP — why Redshift is different

The first thing to understand is that there are two very different jobs a database can do.

OLTP (Online Transaction Processing) is what powers an app: “insert this order”, “fetch user 4821”, “update this balance”. Lots of tiny operations, many at once, each touching a few rows. RDS, Aurora, and DynamoDB are built for this.
OLAP (Online Analytical Processing) is reporting and analytics: scanning millions or billions of rows to compute sums, averages, and trends. This is Redshift’s job.

Redshift is columnar because analytics usually reads a few columns out of very wide tables. If you only need region and sale_amount, a columnar engine reads just those two columns off disk and ignores the other 40. A row-based database (like MySQL) would have to read every full row.

Gotcha: Redshift is for analytics, NOT high-concurrency transactional workloads. If you point a busy web app at Redshift and hammer it with single-row inserts and lookups, it will perform terribly. Single-row INSERT statements are especially slow — Redshift wants data loaded in big batches. Use RDS, Aurora, or DynamoDB for your app, and Redshift for the reports.

When to use Redshift (and when not to)

Workload	Use this
App backend, orders, user logins, carts	RDS / Aurora (SQL) or DynamoDB (NoSQL)
Dashboards, BI reports, joins over billions of rows	Redshift
Key-value lookups at huge scale	DynamoDB
Cache in front of a database	ElastiCache
Ad-hoc queries directly on files in S3, no servers	Redshift Spectrum or Athena

Redshift Serverless vs provisioned clusters

There are two ways to run Redshift, and for most new projects Serverless is the easier starting point.

	Provisioned cluster	Redshift Serverless
What you manage	You pick node type and node count	Nothing — AWS scales it for you
Billing	Per node, per hour, always on	Per second of compute used (RPUs)
Best for	Steady, predictable heavy load	Spiky or unpredictable analytics
Idle cost	You pay even when idle	Scales down; you pay storage only

A node is one compute machine in the cluster. An RPU (Redshift Processing Unit) is the unit of compute Serverless bills you for.

Cost note: A provisioned cluster runs 24/7, so a small ra3.xlplus cluster costs roughly a few hundred dollars a month even if nobody queries it. Serverless charges only for compute you actually use (with a base capacity you set in RPUs), so it is usually cheaper for bursty workloads. Storage in RA3/Serverless is billed separately at managed-storage rates.

Create a Serverless workgroup — Console

Open the Amazon Redshift console and choose Redshift Serverless.
Choose Create workgroup. A workgroup is the compute; a namespace is the storage and database.
Name the workgroup (for example analytics-wg), set Base capacity in RPUs (8 is the smallest), and pick your VPC and subnets.
Create or select a namespace (for example analytics-ns), set the admin username and password.
Choose Create, wait a few minutes, then connect with the Query editor v2.

Create a Serverless workgroup — CLI

aws redshift-serverless create-namespace \
  --namespace-name analytics-ns \
  --admin-username adminuser \
  --admin-user-password 'ChangeMe123!'

aws redshift-serverless create-workgroup \
  --workgroup-name analytics-wg \
  --namespace-name analytics-ns \
  --base-capacity 8 \
  --subnet-ids subnet-0a1b2c3d subnet-0e4f5a6b \
  --security-group-ids sg-0a1b2c3d

Output:

{
    "workgroup": {
        "workgroupName": "analytics-wg",
        "namespaceName": "analytics-ns",
        "baseCapacity": 8,
        "status": "CREATING",
        "endpoint": {
            "address": "analytics-wg.123456789012.us-east-1.redshift-serverless.amazonaws.com",
            "port": 5439
        }
    }
}

Distribution keys and sort keys — model them first

This is the part beginners skip and then wonder why their queries are slow. Because Redshift spreads data across many nodes, how the rows are spread matters enormously for join performance. You control this with two settings on each table.

Distribution key (DISTKEY): the column Redshift uses to decide which node a row lives on. When you join two big tables on the same key and both use that key as their DISTKEY, the matching rows are already on the same node — no data has to be shuffled over the network. This is the single biggest speed lever for big joins.
Sort key (SORTKEY): the column the data is physically ordered by on disk. If you mostly filter by date, sort by date — Redshift can then skip whole blocks that fall outside your WHERE range.

There are three distribution styles:

Style	What it does	When to use
`KEY`	Rows spread by a chosen column’s value	Big tables you join on that column
`ALL`	Full copy of the table on every node	Small dimension tables (lookups)
`EVEN`	Rows spread round-robin	When no clear join key exists

CREATE TABLE sales (
    sale_id      BIGINT,
    customer_id  BIGINT,
    sale_date    DATE,
    region       VARCHAR(20),
    amount       DECIMAL(12,2)
)
DISTKEY (customer_id)   -- co-locate with the customers table
SORTKEY (sale_date);    -- fast date-range filters

CREATE TABLE customers (
    customer_id  BIGINT,
    name         VARCHAR(100)
)
DISTKEY (customer_id);  -- same key, so joins stay on-node

Tip: Decide DISTKEY and SORTKEY based on your biggest, most frequent join and filter — not as an afterthought. Changing them later means recreating and reloading the table.

Loading data the right way

Do not load with single-row inserts. Use the COPY command to pull data in parallel from S3 (Simple Storage Service — AWS object storage).

COPY sales
FROM 's3://my-data-bucket/sales/2026/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET;

This reads many files at once, one per slice, which is the fast path.

Redshift Spectrum — query S3 without loading

Spectrum lets you run SQL directly against files sitting in S3, without first loading them into Redshift. You define an external table that points at the S3 path, then join it to your normal Redshift tables in one query.

CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'logs_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole';

SELECT s.region, COUNT(*)
FROM spectrum_schema.raw_clicks c
JOIN sales s ON c.customer_id = s.customer_id
GROUP BY s.region;

Use Spectrum when you have huge cold data in S3 (raw logs, history) that you query occasionally and do not want to pay to store inside the warehouse.

Best Practices

Keep Redshift for analytics and reporting; never route transactional app traffic to it.
Choose DISTKEY and SORTKEY around your biggest joins and date filters before you load data.
Use DISTSTYLE ALL for small lookup/dimension tables so joins never shuffle them.
Load data in bulk with COPY from S3 in many files; avoid single-row INSERTs.
Prefer Serverless for spiky workloads so you do not pay for idle compute.
Keep cold, rarely queried data in S3 and reach it with Spectrum instead of storing it in the warehouse.
Run VACUUM and ANALYZE (or rely on auto-maintenance) so the planner has fresh stats and reclaims space after big loads.