awesome-everything RU
↑ Back to the climb

Data Engineering

Parquet: build a query-efficient lake table

Crux Hands-on project — convert a CSV dataset to Parquet, lay it out for pushdown and pruning, then prove with before/after numbers that filtered queries read a fraction of the bytes.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about pushdown and the small-files problem is not the same as making a query 50x cheaper. Take a real dataset, lay it out as Parquet the way a senior would, and prove — with bytes-scanned numbers, not hand-waving — that the layout is what does the work.

Goal

Turn the unit’s mental model into a reproducible loop: convert to Parquet, cluster and size for the queries you actually run, push filters and projections into the reader, and verify the win with before/after bytes-scanned and timing on identical queries.

Project
0 of 7
Objective

Take a wide, multi-million-row CSV dataset and turn it into a query-efficient Parquet table whose filtered, projected queries read a small fraction of the bytes a CSV full-scan would — proving each layout decision with measured bytes-scanned, not estimates.

Requirements
Acceptance criteria
  • A before/after table across CSV, naive Parquet, and tuned Parquet: bytes scanned, on-disk size, and query wall time for the identical filtered, projected query.
  • Evidence that the tuned layout actually skips row groups — engine query stats or the count of row groups read versus total — and that the unsorted version skips few or none.
  • A codec comparison (snappy vs zstd) with measured on-disk size and read time, and a one-line recommendation for hot vs cold data.
  • A small-files demonstration: planning/listing time on the tiny-file layout versus the compacted layout, showing the compaction win.
  • A one-paragraph write-up naming, for each win, which mechanism produced it — column pruning, row-group skipping, encoding, or compaction — so the numbers map to causes.
Senior stretch
  • Add page-level statistics and a Bloom filter on a high-cardinality equality column, and show the extra skipping (or that it didn't help and why).
  • Put the tuned Parquet under a table format (Iceberg or Delta Lake) and demonstrate one capability raw files can't give: an atomic schema evolution (add/rename a column) or time travel to a prior snapshot.
  • Add a CI-style check that fails if a query reads more than a threshold fraction of total bytes, so a regression in clustering or projection is caught automatically.
  • Repeat the filtered query in a second engine (e.g. DuckDB and Spark) and show that the same Parquet layout drives skipping consistently across engines.
Recap

This is the loop you run whenever a lake table is slow: convert to Parquet, then make the layout do the work — cluster by the filter columns so min/max ranges are skippable, size row groups by bytes, push the predicate and the column list into the reader, pick a codec by hot-versus-cold, and never dictionary-encode a near-unique column. Then prove it with bytes scanned and row groups skipped on identical queries, and fix the small-files problem with compaction. Doing it once on a real dataset turns the format’s mechanics into instinct.

Continue the climb ↑Materialized views: trading staleness and storage for read latency
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.