Data Engineering DATA · 04 · 01

Materialized views: trading staleness and storage for read latency

A materialized view stores the result of an expensive query instead of recomputing it each read. The real decision is refresh: full recompute locks readers, incremental needs an extension, and a stale refresh silently serves wrong aggregates.

DATA Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

The revenue dashboard times out. A 14-second aggregate over a 400M-row fact table runs on every page load, and the board meeting is in an hour. Someone wraps it in a materialized view: load drops to 40ms, everyone cheers. Three weeks later finance escalates — the dashboard has been showing yesterday’s numbers all morning. The nightly REFRESH MATERIALIZED VIEW cron silently failed on a lock timeout six days ago, and nothing alerted. The view was never stale because it was correct; it was stale because nobody owned the refresh.

Plain view vs materialized view

In ten minutes you will know exactly why a refresh failing silently is more dangerous than a query running slowly — and how to make it visible before finance notices.

A plain view is a saved query — a name for a SELECT. Every read re-executes the full statement against live base tables, so it is always fresh and stores nothing. A materialized view executes the query once, writes the result rows to disk as a real relation, and serves reads from that stored copy. Reads become a cheap table scan or index lookup instead of a 14-second join-and-aggregate. That is the entire value proposition: you pay storage and compute up front, once so that many reads are cheap.

The tradeoff is correctness. A plain view can never be stale. A materialized view is stale the instant a base row changes after the last refresh, and stays wrong until you refresh again. So the question stops being “view or materialized view” and becomes the only question that matters: how, and how often, do you refresh — and what staleness window can the consumer tolerate?

Property	Plain view	Materialized view
Read cost	Full query every read	Scan/lookup on stored rows
Freshness	Always current	As of last refresh — can be stale
Storage	None	Full result set on disk
Write impact	None	Refresh cost (full) or per-write delta (incremental)

REFRESH scans the base tables and rewrites the stored result; reads hit the materialized view, not the live query. Writes that arrive after the last refresh land in the base tables but are not yet in the view — that lag is the staleness gap.

Full refresh: simple, correct, and a lock you will feel

When you first reach for a materialized view, the refresh command looks trivial — one line in a cron job. What surprises engineers is what that line holds locked while it runs.

The default in Postgres is REFRESH MATERIALIZED VIEW mv. It re-runs the defining query from scratch, recomputing every row. It is dead simple and always converges to the right answer — and it takes an ACCESS EXCLUSIVE lock for the whole duration. While that lock is held, every read of the view blocks. Refresh a heavy aggregate that takes 90 seconds and you have just made the dashboard unavailable for 90 seconds, exactly when someone is loading it.

The escape hatch is REFRESH MATERIALIZED VIEW CONCURRENTLY. It builds the new result in the background and swaps it in with row-level diffs, so readers keep seeing the old version until the swap completes — no read-blocking lock. But it has hard requirements and costs a senior must know: it needs at least one UNIQUE index on the view (column names only, no expressions, no WHERE), it is generally slower than a plain refresh because it computes the delta against the existing rows, and only one refresh can run against a given view at a time. It also still waits for in-flight transactions to drain before it starts, so a long-running reporting query can stall the refresh indefinitely.

▸Why this works

CONCURRENTLY “doesn’t block reads” is true but easy to over-trust. It blocks itself — a second refresh queues behind the first — and it waits on old transactions to finish before starting. Under a steady stream of long analytical queries, the refresh can sit waiting so long that you accumulate a backlog of refreshes that never catch up: a refresh that, in effect, never finishes.

Incremental refresh: pay per write, not per read

What if you could make the view cheaper to keep fresh without paying the full query cost every time? That is the question incremental maintenance answers — and why it sounds perfect until you look at what it does to your write path.

Full refresh recomputes everything even if one row changed. Incremental view maintenance (IVM) instead computes only the delta — the rows affected by recent writes — and applies it to the stored view. The cost moves from “expensive periodic recompute” to “small cost on every write to a base table.”

Vanilla Postgres has no native incremental refresh — this surprises people every time. The extension pg_ivm fills the gap: create_immv() populates a real table and installs AFTER triggers on every base table in the query. From then on, each INSERT/UPDATE/DELETE fires a trigger that computes and applies the delta inside the same transaction. The win is near-zero staleness on the views you choose. The cost is exactly inverted from full refresh: every base-table write is now slightly slower, the maintenance holds locks that throttle write throughput, and there are SQL restrictions on what the view can contain. The rule of thumb from the project itself: IVM pays off when the base table is read-heavy and modified infrequently; under a high write rate the per-write overhead dominates and you are better off with a scheduled full refresh.

Pick the best fit

A Postgres dashboard MV aggregates a fact table that ingests ~5,000 inserts/sec all day; analysts read it constantly and tolerate ~5 minutes of staleness. Pick the refresh strategy.

Streaming materialized views: maintained, not refreshed

Outside Postgres, a different model removes the refresh decision entirely: the view is maintained continuously as data arrives. ClickHouse’s incremental MV behaves like an AFTER INSERT trigger on the left-most source table — each inserted block of rows is transformed and written to the target table at insert time, reading only the new rows from RAM, never the disk. The catch a senior must internalize: it only ever sees the block being inserted. It does not process data that existed before the view was created (you need POPULATE for that), and aggregation only groups rows within the same insert block — split a group across two blocks and they are not merged unless your target engine (e.g. AggregatingMergeTree) reconciles them at merge time.

Purpose-built streaming databases push this further. Materialize compiles SQL into a differential dataflow (from Microsoft Research’s Naiad) that operates on changes as (data, time, diff) tuples and does work proportional to the size of the change, not the size of the dataset — maintaining complex joins and aggregations at millisecond latency with strict serializability. The conceptual shift: there is no “refresh” job to schedule, monitor, or have fail silently. The view is always consistent with its inputs by construction. You trade the operational burden of refresh for a different system to run and a different consistency model to reason about — eventually-consistent reads in some setups can surprise users who expect read-your-writes.

Quiz

Why does plain REFRESH MATERIALIZED VIEW (without CONCURRENTLY) make a dashboard unavailable during refresh?

Quiz

When is pg_ivm (incremental maintenance) the wrong choice?

Order the steps

Order the decisions a senior makes when reaching for a materialized view:

1 Confirm the read is genuinely expensive and repeated — otherwise a plain view or index is enough
2 Decide the staleness window the consumer can actually tolerate (seconds? minutes? a day?)
3 Pick refresh: tolerant of minutes → scheduled REFRESH CONCURRENTLY; needs fresh-now + low writes → incremental (pg_ivm)
4 If CONCURRENTLY, add the required UNIQUE index on the view
5 Add monitoring + alerting on the refresh job so a silent failure cannot serve stale data unnoticed

Recall before you leave

01
A teammate proposes a materialized view to speed up a slow report and plans to REFRESH it nightly. What three risks do you raise before approving?
02
Explain the difference between full refresh, incremental maintenance (pg_ivm), and a streaming materialized view, and what each one optimizes for.

Recap

A materialized view stores the result of an expensive query so that many reads become cheap scans instead of repeated recomputation — you pay storage and compute once, up front, to amortize read latency. Unlike a plain view, which re-runs on every read and is therefore always fresh, a materialized view is stale the moment a base row changes, so the only decision that matters is refresh. Full refresh is simple and correct but takes an ACCESS EXCLUSIVE lock that blocks reads; REFRESH … CONCURRENTLY avoids the read-block at the cost of a required unique index, being slower, and waiting on old transactions. Postgres has no native incremental refresh, so pg_ivm uses triggers to apply per-write deltas — great for read-heavy, low-write tables, ruinous under high write rates. Streaming materialized views (ClickHouse on-insert, Materialize’s differential dataflow) sidestep the refresh decision by maintaining the view continuously, trading the refresh job for a new system and consistency model. The failure mode that recurs across all of them: an unmonitored refresh that fails silently serves stale data while everything looks green, so own the refresh, bound the staleness window, and alert on it. Now when you see a dashboard that “just started showing wrong numbers,” your first question is not “what changed in the data?” — it is “when did the refresh last succeed?”

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Idempotent ETL PipelinePipelines don't fail gracefully — they fail at 3 a.m., halfway through a load, and someone re-runs them. This project teaches the one property that separates a hobby script from production data engineering: a run you can repeat any number of times and still land exactly one copy of each row. You'll build batch ingestion, an idempotent load, a watermark for incremental pulls, and the data-quality gates that stop bad data before it poisons everything downstream.