Flyway / Migrations — Theory

Flyway / Migrations — Theory (concise)

The fundamental rule

Schema changes lead code changes; rollbacks reverse the order.

For zero-downtime deploys, every migration must be backwards-compatible with the previously deployed code, AND the new code must work with the old schema. This forces multi-step migrations for any breaking change.

Forward-compatible vs backward-compatible

Forward compat: old code works with new schema (you just deployed schema, code update next).
Backward compat: new code works with old schema (rolling deploy still has old pods running).

Bare ALTER that drops a column breaks both during the rolling transition window.

Versioned vs declarative

Versioned (Flyway, Liquibase, Alembic): imperative scripts, ordered, append-only.
Declarative (Atlas, pgroll): define desired schema, tool computes diff.

Versioned is simpler. Declarative reduces hand-written DDL but is harder to stage.

Idempotency vs ordering

Flyway uses checksums to detect modified-after-applied scripts. Never edit applied migrations; new migration files only.

If you need to fix a wrong migration:

Add a new migration that corrects it.
For dev/staging, use flyway repair or rebuild.

Concurrency

If multiple app instances boot together and try to migrate, they race. Solutions:

Migration job runs once (init container with Job or argo PreSync).
Flyway uses an internal advisory lock.
Some teams gate migration on a single CI step.

Reversibility

Most migrations are not safely reversible in prod:

DROP COLUMN loses data.
DROP TABLE loses data.
A failed forward isn’t undone — the right action is “fix forward”.

Backups + point-in-time recovery are the real rollback for data-destructive ops.

Common interview Qs

Rename a column without downtime? Add new col → dual-write → backfill → switch reads → drop old. 3-4 deploys.
NOT NULL on 100M-row table? Add nullable, backfill in batches, then add NOT NULL constraint.
Migration runs 6 hours and locks the table — what now? Cancel; redesign as online change (chunks, gh-ost, pt-osc, pgroll).
Deploy succeeded, migration failed mid-way. Now what? Investigate via flyway info; manual fix or flyway repair; never silently ignore.
Test migrations? Apply to fresh DB; apply to a snapshot of prod; CI integration tests.
Why not ORM auto-migrate (e.g. TypeORM synchronize:true)? Unsafe in prod, no review, no audit, no rollback story.
MongoDB migrations? Tools: migrate-mongo, mongock; document version field; lazy migration on read.

Anti-patterns

DDL + DML in same migration on huge tables.
Re-using version numbers.
“I’ll fix it on staging” — checksum mismatch carries to prod.
One-shot script touching 50 tables.
No backup before destructive migration.