Data Engineering
Parquet: build a query-efficient lake table
Reading about pushdown and the small-files problem is not the same as making a query 50x cheaper. Take a real dataset, lay it out as Parquet the way a senior would, and prove — with bytes-scanned numbers, not hand-waving — that the layout is what does the work.
Turn the unit’s mental model into a reproducible loop: convert to Parquet, cluster and size for the queries you actually run, push filters and projections into the reader, and verify the win with before/after bytes-scanned and timing on identical queries.
Take a wide, multi-million-row CSV dataset and turn it into a query-efficient Parquet table whose filtered, projected queries read a small fraction of the bytes a CSV full-scan would — proving each layout decision with measured bytes-scanned, not estimates.
- A before/after table across CSV, naive Parquet, and tuned Parquet: bytes scanned, on-disk size, and query wall time for the identical filtered, projected query.
- Evidence that the tuned layout actually skips row groups — engine query stats or the count of row groups read versus total — and that the unsorted version skips few or none.
- A codec comparison (snappy vs zstd) with measured on-disk size and read time, and a one-line recommendation for hot vs cold data.
- A small-files demonstration: planning/listing time on the tiny-file layout versus the compacted layout, showing the compaction win.
- A one-paragraph write-up naming, for each win, which mechanism produced it — column pruning, row-group skipping, encoding, or compaction — so the numbers map to causes.
- Add page-level statistics and a Bloom filter on a high-cardinality equality column, and show the extra skipping (or that it didn't help and why).
- Put the tuned Parquet under a table format (Iceberg or Delta Lake) and demonstrate one capability raw files can't give: an atomic schema evolution (add/rename a column) or time travel to a prior snapshot.
- Add a CI-style check that fails if a query reads more than a threshold fraction of total bytes, so a regression in clustering or projection is caught automatically.
- Repeat the filtered query in a second engine (e.g. DuckDB and Spark) and show that the same Parquet layout drives skipping consistently across engines.
This is the loop you run whenever a lake table is slow: convert to Parquet, then make the layout do the work — cluster by the filter columns so min/max ranges are skippable, size row groups by bytes, push the predicate and the column list into the reader, pick a codec by hot-versus-cold, and never dictionary-encode a near-unique column. Then prove it with bytes scanned and row groups skipped on identical queries, and fix the small-files problem with compaction. Doing it once on a real dataset turns the format’s mechanics into instinct.