awesome-everything RU
↑ Back to the climb

Data Engineering

Parquet: multiple-choice review

Crux Multiple-choice synthesis across the Parquet unit — columnar layout, footer stats and pushdown, encoding vs compression, row-group sizing, and the small-files trap.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 13 min

Six questions that cut across the whole unit. Each one is a decision you make when designing a lake table — not a definition to recite, but a tradeoff to weigh against a real query and a real cost.

Goal

Confirm you can connect columnar layout, footer statistics, encoding versus compression, row-group sizing, and the production traps — the synthesis the lesson built toward.

Quiz

A 40-column events table is queried with SELECT user_id, country WHERE day = '2024-02-01'. Two independent Parquet mechanisms make this far cheaper than CSV. What are they?

Quiz

Your table has per-row-group day stats, yet WHERE day = '2024-02-01' still scans nearly every row group. What is the most likely cause?

Quiz

A teammate says 'Parquet compresses the data, so encoding and compression are the same thing.' What is the precise correction?

Quiz

You enable dictionary encoding on a column of random UUIDs. What is the likely outcome, and why?

Quiz

A Kafka consumer flushes a tiny Parquet file to S3 every 10 seconds. Dashboards over the table have become painfully slow. What is the highest-leverage fix?

Quiz

Raw Parquet files in a directory give you columnar storage and pushdown. What do Iceberg, Delta Lake, and Hudi add on top, and why does it matter?

Recap

The through-line of the unit is one design loop: columnar layout enables column pruning, the footer’s min/max stats enable row-group and page skipping (predicate pushdown) — but only when data is clustered by the filter columns. Encoding and compression are separate stacked wins, and dictionary encoding backfires on high-cardinality columns. The recurring production traps are the small-files problem (fix with compaction, not a better codec) and row-group sizing. And because raw Parquet has no transactions, table formats — Iceberg, Delta Lake, Hudi — wrap it with a manifest layer for ACID, time travel, safe schema evolution, and manifest-level pruning.

Continue the climb ↑Parquet: free-recall review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.