Amazon Redshift (Data Warehouse)
Amazon Redshift is a fully managed data warehouse — a database built for answering big analytical questions over huge piles of data, like “what were total sales by region for every month in the last three years?”. It is columnar (it stores each column together on disk) and MPP (Massively Parallel Processing — it splits one query across many machines that work at the same time). This makes Redshift fast at scanning billions of rows, but it is the wrong tool for the fast, single-row reads and writes that a normal app database does all day.
OLAP vs OLTP — why Redshift is different
The first thing to understand is that there are two very different jobs a database can do.
- OLTP (Online Transaction Processing) is what powers an app: “insert this order”, “fetch user 4821”, “update this balance”. Lots of tiny operations, many at once, each touching a few rows. RDS, Aurora, and DynamoDB are built for this.
- OLAP (Online Analytical Processing) is reporting and analytics: scanning millions or billions of rows to compute sums, averages, and trends. This is Redshift’s job.
Redshift is columnar because analytics usually reads a few columns out of very wide tables. If you only need region and sale_amount, a columnar engine reads just those two columns off disk and ignores the other 40. A row-based database (like MySQL) would have to read every full row.
Gotcha: Redshift is for analytics, NOT high-concurrency transactional workloads. If you point a busy web app at Redshift and hammer it with single-row inserts and lookups, it will perform terribly. Single-row
INSERTstatements are especially slow — Redshift wants data loaded in big batches. Use RDS, Aurora, or DynamoDB for your app, and Redshift for the reports.
When to use Redshift (and when not to)
| Workload | Use this |
|---|---|
| App backend, orders, user logins, carts | RDS / Aurora (SQL) or DynamoDB (NoSQL) |
| Dashboards, BI reports, joins over billions of rows | Redshift |
| Key-value lookups at huge scale | DynamoDB |
| Cache in front of a database | ElastiCache |
| Ad-hoc queries directly on files in S3, no servers | Redshift Spectrum or Athena |
Redshift Serverless vs provisioned clusters
There are two ways to run Redshift, and for most new projects Serverless is the easier starting point.
| Provisioned cluster | Redshift Serverless | |
|---|---|---|
| What you manage | You pick node type and node count | Nothing — AWS scales it for you |
| Billing | Per node, per hour, always on | Per second of compute used (RPUs) |
| Best for | Steady, predictable heavy load | Spiky or unpredictable analytics |
| Idle cost | You pay even when idle | Scales down; you pay storage only |
A node is one compute machine in the cluster. An RPU (Redshift Processing Unit) is the unit of compute Serverless bills you for.
Cost note: A provisioned cluster runs 24/7, so a small
ra3.xlpluscluster costs roughly a few hundred dollars a month even if nobody queries it. Serverless charges only for compute you actually use (with a base capacity you set in RPUs), so it is usually cheaper for bursty workloads. Storage in RA3/Serverless is billed separately at managed-storage rates.
Create a Serverless workgroup — Console
- Open the Amazon Redshift console and choose Redshift Serverless.
- Choose Create workgroup. A workgroup is the compute; a namespace is the storage and database.
- Name the workgroup (for example
analytics-wg), set Base capacity in RPUs (8 is the smallest), and pick your VPC and subnets. - Create or select a namespace (for example
analytics-ns), set the admin username and password. - Choose Create, wait a few minutes, then connect with the Query editor v2.
Create a Serverless workgroup — CLI
aws redshift-serverless create-namespace \
--namespace-name analytics-ns \
--admin-username adminuser \
--admin-user-password 'ChangeMe123!'
aws redshift-serverless create-workgroup \
--workgroup-name analytics-wg \
--namespace-name analytics-ns \
--base-capacity 8 \
--subnet-ids subnet-0a1b2c3d subnet-0e4f5a6b \
--security-group-ids sg-0a1b2c3d
Output:
{
"workgroup": {
"workgroupName": "analytics-wg",
"namespaceName": "analytics-ns",
"baseCapacity": 8,
"status": "CREATING",
"endpoint": {
"address": "analytics-wg.123456789012.us-east-1.redshift-serverless.amazonaws.com",
"port": 5439
}
}
}
Distribution keys and sort keys — model them first
This is the part beginners skip and then wonder why their queries are slow. Because Redshift spreads data across many nodes, how the rows are spread matters enormously for join performance. You control this with two settings on each table.
- Distribution key (DISTKEY): the column Redshift uses to decide which node a row lives on. When you join two big tables on the same key and both use that key as their DISTKEY, the matching rows are already on the same node — no data has to be shuffled over the network. This is the single biggest speed lever for big joins.
- Sort key (SORTKEY): the column the data is physically ordered by on disk. If you mostly filter by date, sort by date — Redshift can then skip whole blocks that fall outside your
WHERErange.
There are three distribution styles:
| Style | What it does | When to use |
|---|---|---|
KEY | Rows spread by a chosen column’s value | Big tables you join on that column |
ALL | Full copy of the table on every node | Small dimension tables (lookups) |
EVEN | Rows spread round-robin | When no clear join key exists |
CREATE TABLE sales (
sale_id BIGINT,
customer_id BIGINT,
sale_date DATE,
region VARCHAR(20),
amount DECIMAL(12,2)
)
DISTKEY (customer_id) -- co-locate with the customers table
SORTKEY (sale_date); -- fast date-range filters
CREATE TABLE customers (
customer_id BIGINT,
name VARCHAR(100)
)
DISTKEY (customer_id); -- same key, so joins stay on-node
Tip: Decide DISTKEY and SORTKEY based on your biggest, most frequent join and filter — not as an afterthought. Changing them later means recreating and reloading the table.
Loading data the right way
Do not load with single-row inserts. Use the COPY command to pull data in parallel from S3 (Simple Storage Service — AWS object storage).
COPY sales
FROM 's3://my-data-bucket/sales/2026/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS PARQUET;
This reads many files at once, one per slice, which is the fast path.
Redshift Spectrum — query S3 without loading
Spectrum lets you run SQL directly against files sitting in S3, without first loading them into Redshift. You define an external table that points at the S3 path, then join it to your normal Redshift tables in one query.
CREATE EXTERNAL SCHEMA spectrum_schema
FROM DATA CATALOG
DATABASE 'logs_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrumRole';
SELECT s.region, COUNT(*)
FROM spectrum_schema.raw_clicks c
JOIN sales s ON c.customer_id = s.customer_id
GROUP BY s.region;
Use Spectrum when you have huge cold data in S3 (raw logs, history) that you query occasionally and do not want to pay to store inside the warehouse.
Best Practices
- Keep Redshift for analytics and reporting; never route transactional app traffic to it.
- Choose DISTKEY and SORTKEY around your biggest joins and date filters before you load data.
- Use
DISTSTYLE ALLfor small lookup/dimension tables so joins never shuffle them. - Load data in bulk with
COPYfrom S3 in many files; avoid single-rowINSERTs. - Prefer Serverless for spiky workloads so you do not pay for idle compute.
- Keep cold, rarely queried data in S3 and reach it with Spectrum instead of storing it in the warehouse.
- Run
VACUUMandANALYZE(or rely on auto-maintenance) so the planner has fresh stats and reclaims space after big loads.