Database Migrations Are the Riskiest Code You Ship

Application code has a safety net. If a deploy goes bad, you roll back to the previous version, and within seconds the system is exactly as it was. The bad code never happened. That safety net is so reliable that most engineers have stopped thinking of deploys as risky at all.

Database migrations don't have that net.

A migration changes state. By the time you've noticed it was wrong, it has already run — the column is already dropped, the rows are already rewritten, the constraint is already rejecting writes. "Roll back" doesn't undo it. There is no previous version of your data to return to, only whatever the migration left behind.

This is the single most important fact about migrations, and most teams' processes don't reflect it. Migrations get the same review as a copy change and far less than a refactor. They should get more caution than anything else in the pipeline.

Why "down migrations" are a comforting fiction

Most migration frameworks let you write a down alongside every up, and this creates a powerful illusion of symmetry — as if a migration were as reversible as a deploy.

It usually isn't.

If your up runs DROP COLUMN email_verified, your down can run ADD COLUMN email_verified — but it cannot bring back the values. The data is gone. The down recreates the shape of the old schema and none of its content. You're left with a column full of defaults where real data used to be.

Even when a down is theoretically clean, it's rarely safe. By the time you want to reverse a migration, the new application code has been running against the new schema, writing data that depends on it. Reverse the schema and you've now orphaned or corrupted everything written since the deploy. The down migration was tested against an empty schema, never against "the new schema with three hours of real production writes on top."

Treat down migrations as what they are: a convenience for resetting your local dev database. They are not a production recovery plan. The production recovery plan for a bad migration is your backups and your point-in-time recovery — and you should know, before you run anything, exactly how long restoring from those would take.

The locking problem nobody sees in review

The second way migrations bite is performance, and it's invisible in code review because the SQL looks trivial.

ALTER TABLE users ADD COLUMN ... is one line. On a small table it's instant. On a large table, depending on your database and the exact operation, it can take a lock that blocks every read or write to that table for the entire duration of the change — which might be seconds, or might be many minutes on a table with tens of millions of rows.

For that whole window, every query touching the table queues behind the lock. Connections pile up. The connection pool exhausts. The application starts returning errors not because the migration failed but because it succeeded slowly while holding a lock. A reviewer reading the diff sees one harmless-looking line and has no way to know it will freeze the busiest table in the system.

The specifics vary by database and version — which operations take which locks, what can be done concurrently, what rewrites the whole table — and you need to know them for your database. The general rule holds everywhere: on a large table, assume every schema change is dangerous until you've checked exactly what lock it takes and for how long.

The pattern that makes migrations safe: expand and contract

The way out is to stop coupling schema changes to code changes in a single deploy. Decouple them with the expand/contract pattern, also called parallel change.

Say you want to rename users.username to users.handle. The unsafe way is one migration that renames the column plus one deploy that switches the code. For a moment, old code expects username and the new schema only has handle — or vice versa — and that moment is an outage.

The safe way is a sequence of small, individually reversible steps:

Expand. Add the new handle column. Add nothing else. The old code doesn't know it exists; nothing breaks. This migration is genuinely reversible — dropping a column nobody reads is safe.

Backfill. Populate handle from username for existing rows, in batches, so you never lock the whole table at once. A backfill that processes 1,000 rows at a time and pauses between batches takes longer in wall-clock time and never blocks production traffic.

Dual-write. Deploy code that writes both columns and still reads the old one. Now every new row is correct under both schemas. The system works whether you're looking at username or handle.

Migrate reads. Deploy code that reads handle instead of username. The old column is still there, still being written, so this deploy is instantly reversible — if reads break, roll the code back and username is untouched.

Contract. Once the new path has been stable in production long enough to trust, stop writing username and drop it.

Every step is independently deployable, independently reversible, and never has a window where old and new code disagree about the schema. It's more steps and more calendar time. That is the cost of not having a rollback button, and it is cheap compared to the alternative.

The process changes that matter

Beyond the pattern, a few practices separate teams that fear migrations from teams that ship them calmly.

Separate schema changes from data changes. A migration that alters structure and a migration that rewrites millions of rows have completely different risk and timing profiles. Don't bundle them. The data migration usually belongs in batched application code, not a single blocking statement.

Test against production-scale data. A migration that's instant on your 5,000-row dev database tells you nothing about its behavior on 50 million rows. Run it against a recent production-sized copy and measure — how long, what lock. If you haven't measured, you don't know.

Make migrations reviewable as the high-risk code they are. A migration touching a large table should get a named reviewer who checks the lock behavior, not a rubber stamp. The review question is not "is the SQL correct" — it's "what happens to production traffic while this runs."

Confirm your recovery path before you run anything. Know that backups are current and know — concretely, in minutes — how long a restore takes. The worst time to discover your point-in-time recovery is misconfigured is the moment you need it.

The goal isn't to make migrations scary. It's the opposite: migrations feel scary precisely because most teams run them in a way that genuinely is. Decouple schema from code, change one thing at a time, measure before you run, and a migration becomes what it should be — a routine, boring, reversible step. Boring is the highest praise a database migration can earn.

Why "down migrations" are a comforting fiction

The locking problem nobody sees in review

The pattern that makes migrations safe: expand and contract

The process changes that matter

Related posts

Subscribe to new posts