Crux Read real PyArrow/DuckDB read paths and a writer config, predict the I/O and GC of bytes scanned, and pick the highest-leverage fix for pushdown, sizing, and projection.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Parquet problems are diagnosed in the read path and the writer config, not in the docs. Read each snippet, predict how many bytes the engine actually touches, and choose the fix a senior engineer makes first.
Goal
Practise the loop you run on every slow lake query: see where the filter and the projection are applied, decide whether the writer laid the data out to be skippable, and reach for the layout fix before the hardware one.
Snippet 1 — pushdown lost in the engine
import pyarrow.parquet as pq# events/ is a year of daily Parquet files, sorted by `day`table = pq.read_table("s3://lake/events/") # reads everythingdf = table.to_pandas()df = df[df["day"] == "2024-02-01"] # filter after the readresult = df[["user_id", "country"]] # project after the read
Quiz
Completed
This reads ~1.9 GB and is slow even though the data is sorted by day. Why, and what is the single highest-leverage fix?
Heads-up The storage class doesn't matter when you're reading 1.9 GB you didn't need. The defect is reading everything then filtering in memory; pushing the filter and column list into read_table is what cuts the bytes.
Heads-up The conversion is downstream of the problem — the engine already read all columns and all row groups. Changing the in-memory representation doesn't reduce what was read off S3.
Heads-up The data is already sorted by day, and sorting governs row-group skipping, not column pruning. The actual bug is that no predicate or column projection was pushed into the reader at all.
Snippet 2 — the writer config
import pyarrow.parquet as pq# one batch write of ~50 GB, queried almost always by `country`pq.write_table( table, "s3://lake/events_2024.parquet", row_group_size=2048, # rows per row group compression="snappy", write_statistics=True,)
Quiz
Completed
For a 50 GB batch write this layout will make filtered queries slow despite write_statistics=True. What is wrong, and what is the fix?
Heads-up Statistics aren't the size problem and gzip is slower to read. The defect is row-group granularity — 2048 rows per group means metadata overhead dominates. Codec choice is orthogonal to row-group sizing.
Heads-up Statistics are exactly what enables row-group skipping — turning them off removes pushdown entirely and makes filtered queries scan everything. The problem is the tiny row-group size, not the presence of stats.
Heads-up Row-group size is a primary read-performance knob: too small and metadata/footer reads dominate; too large and you lose skipping granularity and need huge write memory. 2048 rows is far too small.
Snippet 3 — column projection vs SELECT *
-- DuckDB over a wide 120-column Parquet table on S3SELECT *FROM read_parquet('s3://lake/wide_events/*.parquet')WHERE country = 'DE';
Quiz
Completed
An analyst only needs user_id and amount but runs SELECT * on a 120-column table. What is the cost, and the fix?
Heads-up Columns cost nothing only when you don't select them. SELECT * selects all of them, so every column chunk in every matching row group is read — exactly the case column pruning was meant to avoid.
Heads-up The predicate limits which row groups are read (row dimension); the projection limits which columns are read (column dimension). They're independent — SELECT * still reads all 120 columns of every surviving row group.
Heads-up LIMIT caps row output after scanning; it doesn't reduce the set of columns read per row group. Only naming the columns enables column pruning.
Snippet 4 — dictionary fallback
# a column of high-cardinality session UUIDs, ~10M near-unique valuespq.write_table( table, # contains a `session_id` UUID column "s3://lake/sessions.parquet", use_dictionary=True, # on by default compression="zstd",)
Quiz
Completed
The session_id column ends up barely smaller than plain bytes and write CPU is high. What happened?
Heads-up zstd still compresses the bytes somewhat; the wasted work is the dictionary build that overflowed and was discarded. The encoding-layer fallback, not the codec, is the story here.
Heads-up No truncation or corruption occurs — Parquet cleanly falls back to plain encoding when the dictionary overflows. The symptom is wasted CPU and no size win, not bad data.
Heads-up A bigger row group gives the dictionary more values to absorb, making the cardinality problem worse, not better. The fix is to not dictionary-encode a near-unique column at all.
Recap
Every slow Parquet query is read in the access path and the writer config: push the predicate and the column list into the reader or neither row-group skipping nor column pruning fires; size row groups by bytes toward ~128 MB-1 GB and cluster by the filter column so its ranges are skippable; SELECT * silently reads every column chunk, so always project; and dictionary encoding on a high-cardinality column overflows and falls back to plain, costing CPU for no size win. Diagnose from where the filter and projection land, fix the layout, then re-measure the bytes scanned.