Data Engineering
Parquet: free-recall review
Retrieval beats re-reading. For each prompt, say or write a full answer from memory before you open the model answer — the effort of recall is what makes the layout decisions stick.
Reconstruct the unit’s core mechanisms — columnar layout, footer-driven pushdown, the encoding/compression split, row-group sizing, schema evolution, and what table formats add — without looking back at the lesson.
- 01Explain end to end why a filtered, projected query on Parquet reads far less than the same query on CSV.
- 02Describe the physical nesting inside a Parquet file, from the file down to the encoded values.
- 03How do encoding and compression differ in Parquet, and why keep them mentally separate?
- 04What is the small-files problem, why does it cripple query planning, and how do table formats help?
- 05How do you choose a row-group size, and what goes wrong at each extreme?
- 06Why is schema evolution a trap with raw Parquet, and how do table formats make it safe?
If you could reconstruct each answer from memory you hold the unit’s spine: Parquet is columnar and self-describing, so pruning and pushdown read only what a query needs — but only when data is clustered by the filter columns. The file nests file to row group to column chunk to page, and each page is encoded (a structural, type-aware layer) then compressed (a byte codec) — two separate wins with separate failure modes. Row-group size is a real knob with bad extremes both ways, the small-files problem is fixed by compaction, and because raw Parquet has no transactions or stable schema identity, table formats wrap it with a manifest for ACID, safe schema evolution, time travel, and file-level pruning.