Data Engineering DATA · 03 · 08

Parquet: free-recall review

Free-recall prompts across the Parquet unit — columnar layout, footer pushdown, encoding vs compression, row-group sizing, schema evolution, and table formats.

DATA Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the layout decisions stick.

Goal

Reconstruct the unit’s core mechanisms — columnar layout, footer-driven pushdown, the encoding/compression split, row-group sizing, schema evolution, and what table formats add — without looking back at the lesson.

Recall before you leave

01
Explain end to end why a filtered, projected query on Parquet reads far less than the same query on CSV.
02
Describe the physical nesting inside a Parquet file, from the file down to the encoded values.
03
How do encoding and compression differ in Parquet, and why keep them mentally separate?
04
What is the small-files problem, why does it cripple query planning, and how do table formats help?
05
How do you choose a row-group size, and what goes wrong at each extreme?
06
Why is schema evolution a trap with raw Parquet, and how do table formats make it safe?

Recap

If you could reconstruct each answer from memory you hold the unit’s spine: Parquet is columnar and self-describing, so pruning and pushdown read only what a query needs — but only when data is clustered by the filter columns. The file nests file to row group to column chunk to page, and each page is encoded (a structural, type-aware layer) then compressed (a byte codec) — two separate wins with separate failure modes. Row-group size is a real knob with bad extremes both ways, the small-files problem is fixed by compaction, and because raw Parquet has no transactions or stable schema identity, table formats wrap it with a manifest for ACID, safe schema evolution, time travel, and file-level pruning. Now when you open a slow lake table in production, you will reach for footer stats, clustering, and the access path first — not hardware.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.