Data Engineering DATA · 03 · 09

Parquet: code and config reading

Read real PyArrow/DuckDB read paths and a writer config, predict the I/O and GC of bytes scanned, and pick the highest-leverage fix for pushdown, sizing, and projection.

DATA Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Parquet problems are diagnosed in the read path and the writer config, not in the docs. Read each snippet, predict how many bytes the engine actually touches, and choose the fix a senior engineer makes first.

Goal

Practise the loop you run on every slow lake query: see where the filter and the projection are applied, decide whether the writer laid the data out to be skippable, and reach for the layout fix before the hardware one.

Snippet 1 — pushdown lost in the engine

import pyarrow.parquet as pq

# events/ is a year of daily Parquet files, sorted by `day`
table = pq.read_table("s3://lake/events/")      # reads everything
df = table.to_pandas()
df = df[df["day"] == "2024-02-01"]               # filter after the read
result = df[["user_id", "country"]]              # project after the read

Quiz

This reads ~1.9 GB and is slow even though the data is sorted by day. Why, and what is the single highest-leverage fix?

Snippet 2 — the writer config

import pyarrow.parquet as pq

# one batch write of ~50 GB, queried almost always by `country`
pq.write_table(
    table,
    "s3://lake/events_2024.parquet",
    row_group_size=2048,        # rows per row group
    compression="snappy",
    write_statistics=True,
)

Quiz

For a 50 GB batch write this layout will make filtered queries slow despite write_statistics=True. What is wrong, and what is the fix?

Snippet 3 — column projection vs SELECT *

-- DuckDB over a wide 120-column Parquet table on S3
SELECT *
FROM read_parquet('s3://lake/wide_events/*.parquet')
WHERE country = 'DE';

Quiz

An analyst only needs user_id and amount but runs SELECT * on a 120-column table. What is the cost, and the fix?

Snippet 4 — dictionary fallback

# a column of high-cardinality session UUIDs, ~10M near-unique values
pq.write_table(
    table,                       # contains a `session_id` UUID column
    "s3://lake/sessions.parquet",
    use_dictionary=True,         # on by default
    compression="zstd",
)

Quiz

The session_id column ends up barely smaller than plain bytes and write CPU is high. What happened?

Recap

Every slow Parquet query is read in the access path and the writer config: push the predicate and the column list into the reader or neither row-group skipping nor column pruning fires; size row groups by bytes toward ~128 MB-1 GB and cluster by the filter column so its ranges are skippable; SELECT * silently reads every column chunk, so always project; and dictionary encoding on a high-cardinality column overflows and falls back to plain, costing CPU for no size win. Diagnose from where the filter and projection land, fix the layout, then re-measure the bytes scanned. Now when you see a slow Parquet query in a code review, your first two questions are: where does the predicate land — in the reader or after materialisation — and did the writer cluster and size the row groups for this access pattern?

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.