Databases DB · 01 · 03

Normal forms, denormalization, and why schemas stick

1NF through BCNF in plain terms, when to deliberately denormalize, the ORM traps that quietly break schema discipline, and naming conventions that pay back across years.

DB Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A junior engineer stores the customer’s city on every order row — “it’s convenient for reports.” Two years later, the customer moves cities. Now half the historical orders show the old city and half show the new one. The same fact lives in two places and they disagree. Normalization is the discipline that prevents this.

Normal forms in plain terms

Normalization is a series of progressively stricter rules for splitting tables so no fact is stored twice. By the end of this section you will be able to spot a violation on sight and know exactly which table the offending column belongs in.

1NF — atomic cells. Every cell must hold a single value. No comma-separated lists inside a column, no repeating groups, no arrays disguised as column names (tag1, tag2, tag3). If you can parse a cell with split(','), the table violates 1NF.

2NF — no partial key dependency. Every non-key column must depend on the whole primary key, not part of it. Only matters for composite primary keys. Example violation: a table order_items(order_id, product_id, product_name) — product_name depends on product_id alone, not on the composite key. Fix: move product_name to a products table.

3NF — no transitive dependency. Every non-key column must depend only on the primary key, not on another non-key column. Example violation: orders(order_id, customer_id, customer_city) — customer_city depends on customer_id, not on order_id. Fix: move customer_city to a customers (or addresses) table.

BCNF — every determinant is a candidate key. Stricter than 3NF; matters when a table has multiple overlapping candidate keys. Rare in production; common in academic examples. Violation example: two candidate keys share an attribute and one is not a superkey. In practice, target 3NF; go to BCNF only if you can state the specific overlapping-candidate-key scenario.

Normal form	Rule	Violation shape	Fix
1NF	Atomic cells, no repeating groups	Comma-separated list in one column	Separate table or typed array
2NF	Full key dependency	Non-key col depends on part of composite PK	Move to the owning table
3NF	No transitive dependency	Non-key col depends on another non-key col	Extract to its own table
BCNF	Every determinant is a candidate key	Multiple overlapping candidate keys	Decompose; rare in practice

Together, 1NF through BCNF form a progression: each form is strictly stronger than the one before. In practice, 3NF is the target — beyond that the decomposition cost outweighs the correctness gain. The point is not following rules — it is eliminating the failure mode where two rows can disagree about the same fact. If you can update Anna’s email in three tables, two will be wrong within a year.

Denormalization: when and why

Normalization optimises for write correctness. It can be expensive on the read side because answering a question requires joining many tables. Denormalization deliberately duplicates data to avoid joins on hot read paths.

Stored aggregates. An order_total_cents column on orders that caches SUM(order_items.unit_price_cents * quantity). Refreshed by a trigger or application code on each line-item write. Reporting hits a single column instead of aggregating millions of rows.

Wide read models. A user_profile table that combines name, email, current address, latest order count — fed by triggers or application code from the normalised source tables. Dashboard queries hit one row.

Materialised paths. Storing the full ancestry path of a tree node as a text column (/root/section/subsection/) to avoid recursive joins when reading hierarchies. Classic pattern for comment trees or org charts.

The discipline: denormalize on the read side, keep the authoritative facts normalised, and accept that the duplicates must be reconciled by something — a trigger, a scheduled job, application code. The failure mode is denormalize-everywhere with no reconciliation plan. Every column eventually disagrees with every other.

The whole normalize-vs-denormalize choice in one frame: normalization moves the cost to read time and keeps facts un-duplicated; a stored aggregate moves the cost to write time (~1-5 µs/write) and trades a reconciliation obligation for an O(1) dashboard read once the live SUM crosses ~10M rows.

▸Why this works

Why 3NF and not 4NF, 5NF, or DKNF? Normal forms above 3NF address multi-valued dependencies and join dependencies that appear in highly academic schemas. Real production tables almost never exhibit those shapes. 3NF is the point of diminishing returns: beyond it, you split tables that do not need splitting, add joins that do not add correctness, and slow reads for schemas that had no integrity problem to begin with.

ORM traps that quietly break schema discipline

You might be writing perfect 3NF schemas by hand — but if your team uses an ORM, the generated DDL can silently undo that discipline. These three traps appear in nearly every ORM-managed codebase at some point.

ORMs abstract away SQL but can introduce schema-level mistakes.

N+1 queries. An ORM loads a list of 100 orders, then for each order fetches the customer separately. Result: 101 queries where 1 join would do. In many ORMs this is the default unless you declare an eager-load (include, with, select_related). The schema implication: every FK relation must be annotated with the correct default loading strategy; leaving it to implicit lazy-load accumulates invisible N+1s.

Implicit schema generation. Many ORMs can generate the schema from the model definition. The generated schema often lacks: CHECK constraints, composite indexes, index on the FK column, named constraints. The result is a syntactically valid schema that violates the discipline of this lesson — no constraints, no format enforcement, generic index strategy. Treat ORM-generated schemas as a starting point, not the finished artifact; add constraints and indexes manually or via migration files.

Missing updated_at refresh. ORMs frequently auto-manage created_at but not updated_at. A missing updated_at makes incremental replication and auditing impossible. Enforce via a trigger (BEFORE UPDATE ... SET updated_at = now()) or ORM lifecycle hook; never trust the application to remember.

Naming conventions that pay back

These look pedantic; they pay back across every dashboard, every migration, every cross-team query.

Plural table names (users, orders, order_items) — singular is also defensible; pick one and enforce it everywhere.
snake_case — Postgres folds unquoted identifiers to lower case; snake_case is the production default.
id as the primary key name on every table; foreign keys named <referenced_table_singular>_id (e.g., user_id, order_id).
created_at, updated_at, deleted_at for temporal columns. deleted_at IS NOT NULL is the soft-delete pattern.
Boolean columns prefixed is_, has_, can_ or named as a clear property (active, not flag).
CHECK constraints naming — name every constraint: CONSTRAINT chk_orders_status CHECK (status IN (...)). Unnamed constraints produce cryptic error messages on violation.

Mature platforms CI-lint these conventions so they are enforced across teams, not remembered by individuals.

Order the steps

Order these schema evolution actions from cheapest to most expensive:

1 Add a CHECK constraint on an existing column
2 Add a UNIQUE index (CREATE INDEX CONCURRENTLY)
3 Add a FOREIGN KEY (requires validation pass over existing rows)
4 Change a column type (TEXT → INTEGER) — table rewrite
5 Split a JSONB column into typed columns — backfill migration
6 Split one table into two with FK — dual-write window, parity validation
7 Change a primary key — cascades through every FK and index; almost never done

Quiz

A table stores `orders(order_id, customer_id, customer_city)`. Which normal form does it violate and why?

Pick the best fit

A reporting dashboard needs the total revenue per workspace, recalculated every 5 minutes. Pick the implementation.

Quiz

An ORM-generated schema has no named CHECK constraints and no index on FK columns. What are the two concrete failure modes this causes in production?

customer_city depended on customer_id, not on order_id — a transitive dependency. Moving it to customers stores the fact once, so two order rows can never disagree about a customer's city.

Recall before you leave

01
Explain in your own words why 3NF is the production-default normal form, and when teams legitimately go above (BCNF) or below (deliberate denormalization).
02
Name three concrete ORM traps that quietly violate schema discipline and the fix for each.
03
What is the materialised-path pattern and when does it beat recursive CTEs for tree queries?

Recap

Normal forms eliminate redundant facts: 1NF (atomic cells), 2NF (full-key dependency), 3NF (no transitive dependency), BCNF (every determinant is a candidate key). Target 3NF for production schemas. Denormalize deliberately — stored aggregates, wide read models, materialised paths — with explicit reconciliation; never denormalize without a plan for keeping the copies consistent. ORM-generated schemas skip constraints and indexes; add them. Naming conventions (snake_case, plural tables, id PK, created_at/updated_at/deleted_at, named CHECK constraints) are enforced by CI. Schema decisions are sticky; the cost of the wrong normal form is paid at migration time, not design time. Now when you see a column that stores the same fact as another table, ask which table is the authority — and put the fact there exactly once.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Constraints, keys, and Postgres data typesmiddle

unlocks

JSONB, arrays, and when a side table winsmiddle

deepens into

JSONB, arrays, and when a side table winsmiddle

appears again in166

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.