Data Engineering
Materialized views: trading staleness and storage for read latency
The revenue dashboard times out. A 14-second aggregate over a 400M-row fact table runs on every page load, and the board meeting is in an hour. Someone wraps it in a materialized view: load drops to 40ms, everyone cheers. Three weeks later finance escalates — the dashboard has been showing yesterday’s numbers all morning. The nightly REFRESH MATERIALIZED VIEW cron silently failed on a lock timeout six days ago, and nothing alerted. The view was never stale because it was correct; it was stale because nobody owned the refresh.
Plain view vs materialized view
A plain view is a saved query — a name for a SELECT. Every read re-executes the full statement against live base tables, so it is always fresh and stores nothing. A materialized view executes the query once, writes the result rows to disk as a real relation, and serves reads from that stored copy. Reads become a cheap table scan or index lookup instead of a 14-second join-and-aggregate. That is the entire value proposition: you pay storage and compute up front, once so that many reads are cheap.
The tradeoff is correctness. A plain view can never be stale. A materialized view is stale the instant a base row changes after the last refresh, and stays wrong until you refresh again. So the question stops being “view or materialized view” and becomes the only question that matters: how, and how often, do you refresh — and what staleness window can the consumer tolerate?
| Property | Plain view | Materialized view |
|---|---|---|
| Read cost | Full query every read | Scan/lookup on stored rows |
| Freshness | Always current | As of last refresh — can be stale |
| Storage | None | Full result set on disk |
| Write impact | None | Refresh cost (full) or per-write delta (incremental) |
Full refresh: simple, correct, and a lock you will feel
The default in Postgres is REFRESH MATERIALIZED VIEW mv. It re-runs the defining query from scratch, recomputing every row. It is dead simple and always converges to the right answer — and it takes an ACCESS EXCLUSIVE lock for the whole duration. While that lock is held, every read of the view blocks. Refresh a heavy aggregate that takes 90 seconds and you have just made the dashboard unavailable for 90 seconds, exactly when someone is loading it.
The escape hatch is REFRESH MATERIALIZED VIEW CONCURRENTLY. It builds the new result in the background and swaps it in with row-level diffs, so readers keep seeing the old version until the swap completes — no read-blocking lock. But it has hard requirements and costs a senior must know: it needs at least one UNIQUE index on the view (column names only, no expressions, no WHERE), it is generally slower than a plain refresh because it computes the delta against the existing rows, and only one refresh can run against a given view at a time. It also still waits for in-flight transactions to drain before it starts, so a long-running reporting query can stall the refresh indefinitely.
Why this works
CONCURRENTLY “doesn’t block reads” is true but easy to over-trust. It blocks itself — a second refresh queues behind the first — and it waits on old transactions to finish before starting. Under a steady stream of long analytical queries, the refresh can sit waiting so long that you accumulate a backlog of refreshes that never catch up: a refresh that, in effect, never finishes.
Incremental refresh: pay per write, not per read
Full refresh recomputes everything even if one row changed. Incremental view maintenance (IVM) instead computes only the delta — the rows affected by recent writes — and applies it to the stored view. The cost moves from “expensive periodic recompute” to “small cost on every write to a base table.”
Vanilla Postgres has no native incremental refresh — this surprises people every time. The extension pg_ivm fills the gap: create_immv() populates a real table and installs AFTER triggers on every base table in the query. From then on, each INSERT/UPDATE/DELETE fires a trigger that computes and applies the delta inside the same transaction. The win is near-zero staleness on the views you choose. The cost is exactly inverted from full refresh: every base-table write is now slightly slower, the maintenance holds locks that throttle write throughput, and there are SQL restrictions on what the view can contain. The rule of thumb from the project itself: IVM pays off when the base table is read-heavy and modified infrequently; under a high write rate the per-write overhead dominates and you are better off with a scheduled full refresh.
A Postgres dashboard MV aggregates a fact table that ingests ~5,000 inserts/sec all day; analysts read it constantly and tolerate ~5 minutes of staleness. Pick the refresh strategy.
Streaming materialized views: maintained, not refreshed
Outside Postgres, a different model removes the refresh decision entirely: the view is maintained continuously as data arrives. ClickHouse’s incremental MV behaves like an AFTER INSERT trigger on the left-most source table — each inserted block of rows is transformed and written to the target table at insert time, reading only the new rows from RAM, never the disk. The catch a senior must internalize: it only ever sees the block being inserted. It does not process data that existed before the view was created (you need POPULATE for that), and aggregation only groups rows within the same insert block — split a group across two blocks and they are not merged unless your target engine (e.g. AggregatingMergeTree) reconciles them at merge time.
Purpose-built streaming databases push this further. Materialize compiles SQL into a differential dataflow (from Microsoft Research’s Naiad) that operates on changes as (data, time, diff) tuples and does work proportional to the size of the change, not the size of the dataset — maintaining complex joins and aggregations at millisecond latency with strict serializability. The conceptual shift: there is no “refresh” job to schedule, monitor, or have fail silently. The view is always consistent with its inputs by construction. You trade the operational burden of refresh for a different system to run and a different consistency model to reason about — eventually-consistent reads in some setups can surprise users who expect read-your-writes.
Why does plain REFRESH MATERIALIZED VIEW (without CONCURRENTLY) make a dashboard unavailable during refresh?
When is pg_ivm (incremental maintenance) the wrong choice?
Order the decisions a senior makes when reaching for a materialized view:
- 1 Confirm the read is genuinely expensive and repeated — otherwise a plain view or index is enough
- 2 Decide the staleness window the consumer can actually tolerate (seconds? minutes? a day?)
- 3 Pick refresh: tolerant of minutes → scheduled REFRESH CONCURRENTLY; needs fresh-now + low writes → incremental (pg_ivm)
- 4 If CONCURRENTLY, add the required UNIQUE index on the view
- 5 Add monitoring + alerting on the refresh job so a silent failure cannot serve stale data unnoticed
- 01A teammate proposes a materialized view to speed up a slow report and plans to REFRESH it nightly. What three risks do you raise before approving?
- 02Explain the difference between full refresh, incremental maintenance (pg_ivm), and a streaming materialized view, and what each one optimizes for.
A materialized view stores the result of an expensive query so that many reads become cheap scans instead of repeated recomputation — you pay storage and compute once, up front, to amortize read latency. Unlike a plain view, which re-runs on every read and is therefore always fresh, a materialized view is stale the moment a base row changes, so the only decision that matters is refresh. Full refresh is simple and correct but takes an ACCESS EXCLUSIVE lock that blocks reads; REFRESH … CONCURRENTLY avoids the read-block at the cost of a required unique index, being slower, and waiting on old transactions. Postgres has no native incremental refresh, so pg_ivm uses triggers to apply per-write deltas — great for read-heavy, low-write tables, ruinous under high write rates. Streaming materialized views (ClickHouse on-insert, Materialize’s differential dataflow) sidestep the refresh decision by maintaining the view continuously, trading the refresh job for a new system and consistency model. The failure mode that recurs across all of them: an unmonitored refresh that fails silently serves stale data while everything looks green, so own the refresh, bound the staleness window, and alert on it.