Data Engineering DATA · 03 · 07

Parquet: multiple-choice review

Multiple-choice synthesis across the Parquet unit — columnar layout, footer stats and pushdown, encoding vs compression, row-group sizing, and the small-files trap.

DATA Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

Six questions that cut across the whole unit. Each one is a decision you make when designing a lake table — not a definition to recite, but a tradeoff to weigh against a real query and a real cost.

Goal

Confirm you can connect columnar layout, footer statistics, encoding versus compression, row-group sizing, and the production traps — the synthesis the lesson built toward.

Quiz

A 40-column events table is queried with SELECT user_id, country WHERE day = '2024-02-01'. Two independent Parquet mechanisms make this far cheaper than CSV. What are they?

Quiz

Your table has per-row-group day stats, yet WHERE day = '2024-02-01' still scans nearly every row group. What is the most likely cause?

Quiz

A teammate says 'Parquet compresses the data, so encoding and compression are the same thing.' What is the precise correction?

Quiz

You enable dictionary encoding on a column of random UUIDs. What is the likely outcome, and why?

Quiz

A Kafka consumer flushes a tiny Parquet file to S3 every 10 seconds. Dashboards over the table have become painfully slow. What is the highest-leverage fix?

Quiz

Raw Parquet files in a directory give you columnar storage and pushdown. What do Iceberg, Delta Lake, and Hudi add on top, and why does it matter?

Recap

The through-line of the unit is one design loop: columnar layout enables column pruning, the footer’s min/max stats enable row-group and page skipping (predicate pushdown) — but only when data is clustered by the filter columns. Encoding and compression are separate stacked wins, and dictionary encoding backfires on high-cardinality columns. The recurring production traps are the small-files problem (fix with compaction, not a better codec) and row-group sizing. And because raw Parquet has no transactions, table formats — Iceberg, Delta Lake, Hudi — wrap it with a manifest layer for ACID, time travel, safe schema evolution, and manifest-level pruning. Now when you see a slow lake table, you have a checklist: is the data clustered, are the row groups sized by bytes, is the predicate in the reader, does the codec match the access pattern?

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.