Zero Downtime Deployments: How to Ship Without Taking Your App Offline
A complete guide to zero-downtime deployment strategies — rolling deploys, blue/green, canary releases, database migrations, and the gotchas that cause 2am incidents.
RaidFrame Team
September 28, 2025 · 5 min read
"We'll do the deploy at 2am when traffic is low."
If you're scheduling maintenance windows for deployments, your deployment process is broken. Every modern application should deploy during business hours, multiple times per day, with zero user impact.
Here's how.
Rolling deployments
The most common zero-downtime strategy. New instances roll in while old instances roll out.
Time 0: [v1] [v1] [v1] [v1] ← all running v1
Time 1: [v2] [v1] [v1] [v1] ← first v2 instance ready
Time 2: [v2] [v2] [v1] [v1] ← second v2 ready, first v1 draining
Time 3: [v2] [v2] [v2] [v1] ← third v2 ready
Time 4: [v2] [v2] [v2] [v2] ← complete, all v1 terminatedDuring the transition, both v1 and v2 serve traffic. This means your application must handle:
- Backward-compatible APIs — v2 must accept requests that v1 could handle
- Shared session state — sessions can't be stored in-memory (use Redis)
- Graceful shutdown — v1 instances must finish in-flight requests before terminating
Graceful shutdown
When the platform sends SIGTERM to your app, don't exit immediately. Finish what you're doing first.
let isShuttingDown = false;
process.on("SIGTERM", () => {
isShuttingDown = true;
// Stop accepting new connections
server.close(() => {
// Close database connections
pool.end();
process.exit(0);
});
// Force exit after 30 seconds
setTimeout(() => process.exit(1), 30000);
});
// Health check returns unhealthy during shutdown
app.get("/health", (req, res) => {
if (isShuttingDown) return res.status(503).send("shutting down");
res.status(200).send("ok");
});Connection draining
The load balancer stops sending NEW requests to the old instance but lets EXISTING requests finish. Configure a drain timeout (30-60 seconds is typical).
Blue/green deployments
Run two identical environments. "Blue" is live, "green" is staging the new version.
Before: Traffic → [Blue: v1] [Green: idle]
Deploy: Traffic → [Blue: v1] [Green: v2 deploying]
Switch: Traffic → [Green: v2] [Blue: v1 standby]Advantages:
- Instant switchover (DNS or load balancer change)
- Instant rollback (switch back to blue)
- No mixed-version traffic
Disadvantages:
- 2x infrastructure cost during deployment
- Database schema changes are tricky (both versions share the database)
Canary deployments
Deploy to a small percentage of traffic, monitor, then gradually increase.
Step 1: 5% → v2, 95% → v1 (monitor for 10 min)
Step 2: 25% → v2, 75% → v1 (monitor for 10 min)
Step 3: 50% → v2, 50% → v1 (monitor for 10 min)
Step 4: 100% → v2 (complete)If error rates or latency spike at any step, automatically roll back.
Best for: High-traffic services where a bad deploy affects millions of users. The canary catches issues before they reach everyone.
Try RaidFrame free
Deploy your first app in 60 seconds. No credit card required.
Database migrations (the hard part)
The number one cause of deployment downtime is database migrations. You can't just ALTER TABLE in production and hope for the best.
The expand-contract pattern
Step 1: Expand — add new columns/tables without removing old ones
-- Deploy 1: Add new column, nullable
ALTER TABLE users ADD COLUMN display_name varchar(255);Step 2: Migrate — backfill data, update application to use new column
-- Background job: copy data
UPDATE users SET display_name = name WHERE display_name IS NULL;Step 3: Contract — remove old column after all code uses the new one
-- Deploy 3: Drop old column (weeks later, after verification)
ALTER TABLE users DROP COLUMN name;Each step is its own deployment. The database is always compatible with both the old and new application code.
Dangerous migrations
Never do these in a single deploy:
- Rename a column (old code can't find it)
- Change a column type (existing data may not convert)
- Add a NOT NULL column without a default (existing rows fail)
- Drop a column that old code still reads
Always do these instead:
- Add a new column, migrate data, drop old column (3 deploys)
- Add a new table, dual-write, backfill, switch reads, drop old table
- Use feature flags to control which code path runs
Lock-free migrations
Large ALTER TABLE operations lock the table. On a table with millions of rows, this can lock for minutes.
Use pg_repack or CREATE INDEX CONCURRENTLY for lock-free operations:
-- Bad: locks the table
CREATE INDEX idx_users_email ON users(email);
-- Good: no lock
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);Feature flags
Feature flags let you deploy code without activating it. Ship the feature to production behind a flag, then enable it when ready.
if (featureFlags.isEnabled("new-checkout-flow", user)) {
return renderNewCheckout();
} else {
return renderOldCheckout();
}This decouples deployment from release. You can deploy 10 times a day and release features on a completely separate schedule.
Monitoring during deployment
Watch these metrics during every deploy:
- Error rate — should not increase. If it does, rollback.
- Response time (p95) — should not increase significantly.
- CPU/memory — new version shouldn't use dramatically more resources.
- Business metrics — conversion rate, signup rate, revenue. If they drop, investigate.
Set up automatic rollback triggers:
- Error rate > 5% → rollback
- p95 latency > 2x baseline → rollback
- Health check failures > 0 → stop rollout
Zero-downtime deploys on RaidFrame
Every deploy on RaidFrame is zero-downtime by default:
rf deployThe platform handles:
- Rolling deployment with health checks
- Connection draining (30s configurable)
- Automatic rollback on health check failure
- Preview environments for every PR
- One-command rollback to any previous version
No maintenance windows. No 2am deploys. Ship whenever you want.
Ship faster with RaidFrame
Auto-scaling compute, managed databases, global CDN, and zero-config CI/CD. Free tier included.