Zero-Downtime Deployments
When you push a new version of your app, the naive approach is to stop the old process and start the new one. For the second or two in between, every request that arrives gets an error. Zero-downtime deployment means swapping in the new version without ever dropping a single request — users never notice anything happened. This page explains the three techniques that make it possible: graceful reloads, health checks, and the symlink-swap release pattern.
Why a plain restart drops requests
A normal restart has a gap. The moment you run systemctl restart myapp, the old process is killed instantly, even if it was halfway through answering someone. Then the new process needs a few seconds to boot, open its port, and warm up. Any request arriving during that window has nothing to talk to, so the user sees a 502 Bad Gateway or a hung connection.
Zero-downtime deployment removes the gap by following one rule: never stop the old version until the new version is fully ready and proven healthy. Everything below is a way to honour that rule.
Graceful reloads
A graceful reload tells a running process to load new configuration or new code without killing in-flight requests. Instead of dying immediately, the process stops accepting new connections, finishes the ones it already has (this is called draining — letting existing requests drain out before shutdown), and only then exits. A fresh worker takes over new traffic at the same time.
Nginx reload
Nginx (a reverse proxy — a server that sits in front of your app and forwards requests to it) supports this natively. After editing a config file, never run restart; run reload:
sudo nginx -t
sudo systemctl reload nginx
Output:
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
nginx -t checks the config for typos first. reload then starts new worker processes with the new config while the old workers keep serving their current requests until those finish. No connection is dropped.
Gotcha: Always run
nginx -tbefore reloading. Areloadwith a broken config is rejected and Nginx keeps the old config running — but arestartwith a broken config leaves your site down. Reload is the safe verb.
PM2 reload
For Node.js apps managed by PM2 (a process manager that keeps your app alive and restarts it if it crashes), use reload instead of restart:
pm2 reload myapp
pm2 restart kills and respawns the process, causing a brief gap. pm2 reload does a rolling restart: it starts a new worker, waits for it to be ready, shifts traffic over, then retires the old one. For this to be truly seamless your app should run in cluster mode (multiple workers) so there is always at least one worker serving traffic.
| Command | Behaviour | Downtime? |
|---|---|---|
pm2 restart | Kill then start | Yes — a short gap |
pm2 reload | Rolling, one worker at a time | No (in cluster mode) |
systemctl restart | Kill then start | Yes |
systemctl reload | Only if the app handles SIGHUP | Depends on the app |
Health checks
A health check is a tiny endpoint your app exposes — usually /healthz or /health — that returns 200 OK only when the app is fully booted and able to serve traffic. The deployment script polls this endpoint and refuses to send real users to the new version until it answers correctly.
A minimal check loop in a deploy script looks like this:
#!/usr/bin/env bash
# Wait up to 30 seconds for the new instance on port 3001 to become healthy
for i in $(seq 1 30); do
if curl -fs http://127.0.0.1:3001/healthz > /dev/null; then
echo "New instance is healthy"
exit 0
fi
echo "Waiting for app to start... ($i)"
sleep 1
done
echo "App never became healthy — aborting deploy"
exit 1
Output:
Waiting for app to start... (1)
Waiting for app to start... (2)
New instance is healthy
When to use this: always, in any automated deploy. The health check is what turns “I hope it started” into “I know it started.” Without it, your script might switch traffic to a process that crashed on boot.
Two instances behind Nginx
The simplest real zero-downtime setup runs two copies of your app on different ports (say 3000 and 3001) and puts Nginx in front of them. You deploy by starting the new version on the spare port, health-checking it, then telling Nginx to send traffic there.
Define both as an upstream group in /etc/nginx/sites-available/myapp:
upstream myapp {
server 127.0.0.1:3000; # blue
server 127.0.0.1:3001 backup; # green (idle until promoted)
}
server {
listen 80;
server_name example.com;
location / {
proxy_pass http://myapp;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
To deploy the new version: start it on 3001, wait for /healthz, then flip which server is primary in the config and run sudo systemctl reload nginx. Because Nginx reloads gracefully, traffic moves over with no dropped requests. This is the foundation of the blue-green strategy covered on the deployment-strategies page.
The symlink-swap release pattern
This pattern makes the code switch itself atomic. Instead of overwriting your app folder in place (which leaves it half-updated for a moment), you put each release in its own timestamped directory and point a current symlink (a pointer that looks like a folder but redirects to another path) at the live one.
RELEASE="/var/www/myapp/releases/$(date +%Y%m%d%H%M%S)"
sudo mkdir -p "$RELEASE"
sudo git clone https://github.com/your-org/your-app.git "$RELEASE"
cd "$RELEASE" && sudo npm ci && sudo npm run build
# Atomic swap: the -n -f flags replace the symlink in one step
sudo ln -sfn "$RELEASE" /var/www/myapp/current
pm2 reload myapp
Your app and Nginx always reference /var/www/myapp/current. Switching the symlink is a single filesystem operation — there is no in-between state. Rolling back is just pointing current at the previous release directory and reloading. Keep the last few releases so a rollback is instant.
Best practices
- Always use
reload, neverrestart, for Nginx — and runnginx -tfirst. - Expose a real
/healthzendpoint and gate every deploy on it before shifting traffic. - Run at least two app instances (cluster mode or two ports) so one can serve while the other updates.
- Handle
SIGTERM/SIGINTin your app: stop accepting new requests, finish open ones, then exit (graceful shutdown). - Use the symlink-swap layout so the code switch is atomic and rollback is one command.
- Keep the previous 3-5 releases on disk so you can roll back instantly without rebuilding.
- Set a sensible drain timeout (often 10-30s) so slow requests finish but stuck ones do not block forever.