Databases
Schema integrity: deferral, versioning, and production failure modes
A team runs a batch migration — 3 million rows relinked to new parents. A foreign key violation fires at row 2,847,221. The transaction aborts. All three million updates roll back. The migration runs again at 3 AM. The same violation fires. No one thought to check for orphan rows before starting.
State-transition enforcement
SQL CHECK constraints are evaluated on every write and must be deterministic. They can reference other columns of the same row but not other rows. This means pure-column CHECK can enforce “status must be one of these values” but cannot enforce “completed orders cannot transition back to pending.”
For state-transition invariants, the right tools are:
Row-level triggers. A BEFORE UPDATE trigger that raises an exception if the transition is disallowed:
CREATE OR REPLACE FUNCTION enforce_order_transitions()
RETURNS TRIGGER AS $$
BEGIN
IF OLD.status = 'completed' AND NEW.status != 'completed' THEN
RAISE EXCEPTION 'order % cannot leave completed state', OLD.id;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER trg_order_status
BEFORE UPDATE ON orders
FOR EACH ROW EXECUTE FUNCTION enforce_order_transitions();Generated columns with CHECK. A GENERATED ALWAYS AS (status IN ('completed','cancelled')) STORED column named is_final, plus a trigger or application check that refuses updates where OLD.is_final = true. The generated column documents the finality rule in the schema.
Row-level security. Postgres RLS policies can encode access rules that depend on row state — for example, USING (status != 'archived') on UPDATE prevents updating archived rows without requiring application-level checks.
The discipline: business rules that can be expressed declaratively (in CHECK or RLS) live in the schema. Rules that require procedural logic (state transitions, cross-row checks) live in triggers. Application-level-only enforcement is the weakest — every consumer must know the rules.
Constraint deferral
By default, Postgres checks constraints after each individual statement. For some complex multi-table operations, intermediate states would violate a constraint even though the final state is valid.
-- A circular FK: A references B, B references A
-- Inserting A requires B to exist; inserting B requires A to exist
-- Neither can be inserted first with IMMEDIATE constraints
ALTER TABLE a ALTER CONSTRAINT fk_a_to_b DEFERRABLE INITIALLY DEFERRED;
ALTER TABLE b ALTER CONSTRAINT fk_b_to_a DEFERRABLE INITIALLY DEFERRED;
BEGIN;
INSERT INTO a (id, b_id) VALUES (1, 1); -- b_id=1 does not exist yet; deferred: OK
INSERT INTO b (id, a_id) VALUES (1, 1); -- a_id=1 exists now; deferred check: consistent
COMMIT; -- Both FK checks run here; both passDEFERRABLE INITIALLY DEFERRED means the check is deferred until COMMIT. DEFERRABLE INITIALLY IMMEDIATE declares the constraint deferrable but immediate by default; you can switch it per-transaction with SET CONSTRAINTS ... DEFERRED.
Use deferral sparingly. Deferred constraints are checked at COMMIT. A failing deferred constraint aborts the entire transaction with no in-flight indication — unlike an immediate constraint, which fails the specific statement. This makes debugging harder. Production hygiene: use DEFERRABLE only for FKs with genuine circular or multi-step dependencies, and only in transactions where the temporary violation is well-understood.
| Mode | When checked | Failure point | Use case |
|---|---|---|---|
| IMMEDIATE (default) | After each statement | At the violating statement | All standard constraints |
| DEFERRABLE INITIALLY IMMEDIATE | After each statement (unless overridden) | At the violating statement | Conditionally deferrable; rare |
| DEFERRABLE INITIALLY DEFERRED | At COMMIT | Transaction aborts at COMMIT | Circular FK, graph relinks |
Schema versioning: expand-then-contract
A relational schema is a contract between the database and every consumer. Adding a nullable column is non-breaking. Adding a NOT NULL column without a default breaks every existing INSERT. Dropping a column breaks every reader. Renaming breaks both.
Senior teams manage schema as code — migration files in version control (Flyway, Liquibase, golang-migrate, Prisma Migrate, sqlx) — and follow the expand-then-contract pattern for breaking changes:
- Expand: add the new shape alongside the old. New column nullable, no change to existing code.
- Deploy code that writes both. Application writes to both old and new column. Old column still readable.
- Backfill the new column for existing rows (in batches to avoid long locks).
- Deploy code that reads from new. Application reads from new column; falls back to old during transition.
- Validate parity. A check query compares old and new values; zero mismatches before proceeding.
- Contract: remove the old shape in a subsequent migration.
The pattern applies to column renames, type changes, and table splits. It adds one deployment cycle per breaking change; the alternative is downtime or silent data loss.
Why this works
Why not just run a migration in a maintenance window? For small tables, a maintenance window is fine. For a 100M-row table, ALTER TABLE can hold an exclusive lock for hours while it rewrites the table. Modern Postgres supports ALTER TABLE ... ADD COLUMN ... DEFAULT ... without a rewrite (as of Postgres 11), but column-type changes still require a rewrite. Online schema change tools (pg_repack, pgroll, schema-change-as-migration patterns) let you make breaking changes with zero downtime at the cost of operational complexity. Expand-then-contract is the code-level version of the same discipline.
Five production failure modes
These failures recur regardless of how carefully the schema was designed.
1. Implicit type coercion killing an index. A query WHERE user_id = '42' (string literal instead of integer) triggers an implicit cast on every row. Postgres cannot use the integer B-tree index because the comparison is between the indexed integer and a text value. EXPLAIN ANALYZE shows a sequential scan. Fix: use typed query builders or ORMs that bind the correct parameter type.
2. NULL semantics in joins causing silent row drops. A LEFT JOIN where the joined column contains NULL silently drops those rows in downstream inner JOINs. NULL != NULL (NULL is not equal to anything, including itself). Senior pattern: use IS DISTINCT FROM for null-safe comparison and explicit COALESCE for null-tolerant logic in WHERE clauses.
3. Cascade cascade cascade — unbounded transactions. A DELETE on a heavily-cascaded root row creates a multi-million-row transaction that holds locks for minutes. All writes to cascaded tables are blocked. Fix: soft-delete the root row (deleted_at = now()), then batch-clean the descendants with explicit DELETE loops limited to 10,000 rows at a time.
4. Schema drift via ad-hoc ALTER. A developer runs ALTER TABLE directly in production outside the migration system. The migration tool no longer matches reality; future migrations fail mysteriously or produce inconsistent state. Fix: every schema change goes through the migration system, period. CI enforces the rule.
5. JSONB columns that grew into relational shape. A “metadata” JSONB column accumulated 30 typed fields over two years. Each new feature adds IF metadata->>'feature_x' = .... The column now has schema-on-read with no enforcement. The right move is to extract columns — but the migration is expensive and keeps getting postponed. Fix: the “promote-to-column” rule: the moment a field appears in a WHERE clause, schedule a column-extraction migration.
A team needs to delete a workspace and all its data (projects, tasks, comments). There are 5 million comments cascaded from tasks. What is wrong with `DELETE FROM workspaces WHERE id = X` with CASCADE FKs?
A column needs to be renamed from `user_name` to `display_name` on a 50M-row table with zero downtime. Pick the approach.
A query `WHERE user_id = '42'` (string literal) on a column declared as BIGINT runs a sequential scan on a 30M-row table despite a B-tree index on user_id. Why?
- 01Describe the expand-then-contract pattern and explain why it is required for renaming a column on a large shared-schema table.
- 02What is the specific risk of DEFERRABLE INITIALLY DEFERRED constraints compared to IMMEDIATE constraints?
- 03Name three of the five production failure modes and give the fix for each.
State-transition enforcement belongs in triggers or RLS when the rule requires procedural logic; pure CHECK handles per-row invariants. DEFERRABLE INITIALLY DEFERRED delays FK checks until COMMIT — useful for circular dependencies, dangerous because violations abort the whole transaction. Schema versioning uses migration files in version control; breaking changes use expand-then-contract to avoid downtime. The five production failure modes (implicit coercion killing indexes, NULL semantics in joins, unbounded cascade transactions, schema drift via ad-hoc ALTER, JSONB sprawl) each have specific fixes that are cheaper than discovering them in production. Schema decisions are sticky — the cost of the wrong design is paid at migration time.
appears again in164
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- Service worker lifecycle and cache strategiesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Service worker edge cases: version skew, durability, and navigation trapssenior
- Worker pools, Comlink, and production observabilitysenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- CLS: why layout shifts happen and how to stop themmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- What TLS does and why it existsjunior
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- Connection IDs and network migrationmiddle
- 0-RTT resumption and packet encryptionsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- PII redaction and log injectionsenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- What is OpenTelemetry: API, SDK, Collector, OTLPjunior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Error budget policy, latency SLOs, and composite journeysmiddle
- Production SLO failures, self-observability, security, and the big picturesenior
- What is trace propagation and why broken propagation is worse than nonejunior
- traceparent and tracestate: the W3C header format in fullmiddle
- Baggage and async boundaries: carrying context across queues and callbacksmiddle
- Async context per language, service mesh, B3 migration, and securitysenior
- Production propagation failures, span links, and platform designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- The incident loop: from pager to postmortem to preventionmiddle
- Scale, security, and the ROI of observable systemssenior
- Cache lines, struct layout, and false sharingmiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior