Data Engineering DATA · 03 · 10

Parquet: build a query-efficient lake table

Hands-on project — convert a CSV dataset to Parquet, lay it out for pushdown and pruning, then prove with before/after numbers that filtered queries read a fraction of the bytes.

DATA Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about pushdown and the small-files problem is not the same as making a query 50x cheaper. Take a real dataset, lay it out as Parquet the way a senior would, and prove — with bytes-scanned numbers, not hand-waving — that the layout is what does the work.

Goal

Turn the unit’s mental model into a reproducible loop: convert to Parquet, cluster and size for the queries you actually run, push filters and projections into the reader, and verify the win with before/after bytes-scanned and timing on identical queries.

Project

0 of 7

Objective

Take a wide, multi-million-row CSV dataset and turn it into a query-efficient Parquet table whose filtered, projected queries read a small fraction of the bytes a CSV full-scan would — proving each layout decision with measured bytes-scanned, not estimates.

Requirements

Acceptance criteria

A before/after table across CSV, naive Parquet, and tuned Parquet: bytes scanned, on-disk size, and query wall time for the identical filtered, projected query.
Evidence that the tuned layout actually skips row groups — engine query stats or the count of row groups read versus total — and that the unsorted version skips few or none.
A codec comparison (snappy vs zstd) with measured on-disk size and read time, and a one-line recommendation for hot vs cold data.
A small-files demonstration: planning/listing time on the tiny-file layout versus the compacted layout, showing the compaction win.
A one-paragraph write-up naming, for each win, which mechanism produced it — column pruning, row-group skipping, encoding, or compaction — so the numbers map to causes.

Senior stretch

Add page-level statistics and a Bloom filter on a high-cardinality equality column, and show the extra skipping (or that it didn't help and why).
Put the tuned Parquet under a table format (Iceberg or Delta Lake) and demonstrate one capability raw files can't give: an atomic schema evolution (add/rename a column) or time travel to a prior snapshot.
Add a CI-style check that fails if a query reads more than a threshold fraction of total bytes, so a regression in clustering or projection is caught automatically.
Repeat the filtered query in a second engine (e.g. DuckDB and Spark) and show that the same Parquet layout drives skipping consistently across engines.

Recap

This is the loop you run whenever a lake table is slow: convert to Parquet, then make the layout do the work — cluster by the filter columns so min/max ranges are skippable, size row groups by bytes, push the predicate and the column list into the reader, pick a codec by hot-versus-cold, and never dictionary-encode a near-unique column. Then prove it with bytes scanned and row groups skipped on identical queries, and fix the small-files problem with compaction. Doing it once on a real dataset turns the format’s mechanics into instinct.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.