Data Engineering DATA · 03 · 01

Parquet: why analytics stores columns, not rows

Parquet is columnar, self-describing, and carries per-row-group min/max stats in its footer — so a filter can skip whole row groups and read only the columns it needs. CSV must parse every byte. The senior traps are small files, wrong row-group size, and dictionary blowup.

DATA Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

A nightly Spark job writes the events table to S3 as gzip CSV. A simple SELECT user_id, country FROM events WHERE day = '2024-02-01' reads 1.9 GB, decompresses every byte, parses every column, throws away 38 of 40 columns, and takes 90 seconds. The analyst reruns it forty times tuning a dashboard. Same data as Parquet with snappy: 220 MB on disk, the engine reads two column chunks and one day’s row groups, and the query returns in under two seconds. Nothing about the data changed — only the file format.

In ten minutes you will know exactly why that 90-second query happens and what three layout decisions make it go away.

Columns on disk, not rows

Why does the file format matter so much? Because the engine can only skip what it knows isn’t needed — and CSV gives it nothing to go on. Here is what happens instead in Parquet.

CSV and JSON store data row by row: every field of row 1, then every field of row 2. To read one column you still walk every byte of every row, parse the delimiters, and discard 95% of what you parsed. There are no statistics, no types — "2024" could be a string or a number, the reader can’t know without parsing it.

Parquet flips the layout. Inside each file, data is split into row groups (a horizontal slice of rows, default target ~128 MB), and inside a row group each column is stored as a contiguous column chunk, itself split into pages (the unit of encoding and compression, ~1 MB by default). Because all values of one column sit together, two things follow that are impossible in CSV. First, column pruning: a query that needs user_id and country reads only those two column chunks and never touches the other 38. Second, encoding gets brutally efficient — a million rows of country is mostly the same handful of strings, and a column of similar values compresses far better than a row of mixed types.

Every Parquet file ends with a footer: the schema, plus per-row-group, per-column metadata — and crucially min/max/null-count statistics for each column in each row group. This is the single most important structure for query performance.

When you run WHERE day = '2024-02-01', the engine reads the footer first, looks at each row group’s day min and max, and skips any row group whose range can’t contain the value without reading a single data byte. This is predicate pushdown plus row-group skipping. On a year of daily data sorted by day, that filter touches roughly 1 of 365 row groups. Page-level stats (and optional page indexes) push the same idea one level deeper, skipping pages inside a surviving row group. CSV can do none of this — no stats means the only way to know if a row matches is to read and parse it. When you write a filter and wonder why it’s still slow, the footer stats are the first place to look: are the row-group ranges actually narrow enough to skip anything?

▸Why this works

Skipping only works if the data is laid out so the filter column’s min/max ranges are narrow and non-overlapping. Sort or partition by the columns you filter on most. If day values are randomly scattered across row groups, every row group’s [min, max] spans the whole year, no row group can be skipped, and you’ve paid for statistics that buy nothing. Stats are necessary but not sufficient — clustering is what makes them pay.

Predicate pushdown: the footer's per-row-group day min/max lets the engine skip row groups 1 and 3 (ranges exclude 2024-02-01) and read only row group 2 — and within it, only the day and country column chunks if those are the projected columns.

Encoding and compression: two separate wins

Parquet shrinks data in two stacked layers, and seniors keep them mentally separate. Encoding is type-aware and lossless-by-construction: dictionary encoding replaces repeated values with small integer codes (ideal for low-cardinality strings like country, status), run-length encoding (RLE) collapses runs of identical values, bit-packing uses only as many bits as the value range needs, and delta encoding stores differences for sorted integers and timestamps. Compression (snappy, zstd, gzip) then runs a general-purpose codec over the already-encoded bytes.

The numbers are why the format won. A CSV that is 1.9 GB lands around 1.2 GB as raw uncompressed Parquet (encoding alone), and with compression typically 9–15% of the original CSV size. snappy is the common default — fast, cheap CPU, ~8x ratio. zstd gives roughly 15–20% smaller files than snappy for under a 1% read penalty, which is why it’s the better choice for cold or archival data; gzip compresses tighter still but is markedly slower to read.

Property	CSV / JSON	Parquet
Layout	Row-oriented	Columnar (row group → column chunk → page)
Read only some columns	No — full parse	Yes — column pruning
Skip rows by filter	No stats → read everything	Footer min/max → skip row groups + pages
Schema / types	None — guessed at read	Embedded, typed, self-describing
Size vs CSV	1x (baseline)	~0.09–0.15x compressed

Where it bites in production

The format looks fast in every demo. The question is: which of these failure modes will you hit first in your own pipeline?

Parquet is not free of footguns, and every one of them shows up at scale, not in the demo.

The first is the small-files problem. Each Parquet file carries its own footer and costs one I/O to open and read. A streaming job that writes a 4 KB file every few seconds produces millions of tiny files, and now the slowest part of every query is listing them — recursively enumerating millions of objects in S3 is notoriously slow and expensive, and the query planner must read every footer before it can plan. The fix is compaction: periodically rewrite the swarm into files in the 128 MB – 1 GB range.

The second is row-group size. Too large and you lose granularity for skipping and need huge memory to write a single group; too small and metadata overhead balloons and you read more footers than data. The third is dictionary blowup: dictionary encoding is fantastic for low-cardinality columns, but on a high-cardinality column (a UUID, a free-text field) the dictionary grows past its page-size budget (~1 MB), Parquet falls back to plain encoding, and you’ve paid to build a dictionary that bought nothing — sometimes a larger file. The fourth is schema evolution: adding, renaming, or reordering columns across thousands of files leaves readers reconciling mismatched schemas, and a renamed-then-reused column name can silently mix two meanings.

Together these four failure modes share a pattern: each one looks fine at prototype scale and invisible until you hit production volume. When you see a query that’s mysteriously slow despite good stats, check this list in order — small files and wrong row-group size account for the majority of cases.

Table formats: Parquet plus a brain

Raw Parquet is just files in a directory; it has no notion of a transaction, a snapshot, or “the current set of files.” This is exactly the gap Iceberg, Delta Lake, and Hudi fill. They keep your data in Parquet but add a metadata/manifest layer on top that tracks which files belong to the table right now, giving you ACID commits, time travel, safe schema evolution (column renames become metadata-only operations, not data rewrites), and manifest-level pruning — the engine consults the manifest’s per-file stats to eliminate whole files before it even opens a footer. When people say “data lakehouse,” this is the layer they mean: Parquet for storage, a table format for correctness and planning.

Pick the best fit

A Kafka consumer writes click events to S3 as Parquet, flushing a tiny file every 10 seconds. Dashboards over this table have gotten painfully slow. Pick the fix.

Quiz

A query is WHERE day = '2024-02-01' on a year of data. What lets Parquet read far less than CSV would?

Quiz

You enable dictionary encoding on a column of random UUIDs. What's the likely outcome?

Order the steps

Order the physical containment, outermost to innermost:

1 File (ends with a footer holding schema + per-row-group stats)
2 Row group (a horizontal slice of rows, ~128 MB target)
3 Column chunk (all values of one column within that row group, contiguous)
4 Page (the unit of encoding + compression, ~1 MB)
5 Encoded + compressed values (dictionary / RLE / delta, then snappy/zstd)

File (events.parquet) footer: schema + per-row-group min/max/null stats

Row group ~128 MB horizontal slice of rows

Column chunk all values of one column — column pruning lives here

Page ~1 MB — unit of encoding + compression

Encoded + compressed values dictionary / RLE / delta → snappy / zstd

Column pruning eliminates whole column chunks before a byte is read; row-group skipping uses the footer's min/max to avoid entire horizontal slices. Both savings are only possible because the footer exists at the file level, not scattered inside the data.

Recall before you leave

01
Explain to a teammate, end to end, why a filtered query on Parquet reads so much less than the same query on CSV.
02
What is the small-files problem, why does it cripple query planning, and how do table formats help?

Recap

Parquet stores data column by column instead of row by row, which makes two things possible that CSV and JSON cannot do: column pruning, where a query reads only the column chunks it needs, and row-group skipping, where the engine uses min/max/null-count statistics in the file footer to skip whole row groups (and pages) that can’t satisfy the filter — predicate pushdown. Inside, the layout nests file → row group (~128 MB) → column chunk → page (~1 MB), and each page is first encoded (dictionary, RLE, bit-packing, delta) and then compressed (snappy for hot data, zstd for ~15–20% smaller cold files), landing typically at 9–15% of the original CSV size. The senior failure modes are the small-files problem (millions of tiny files make listing and footer-reading dominate query planning — fix with compaction), wrong row-group size, dictionary blowup on high-cardinality columns (fallback to plain encoding, sometimes a larger file), and messy schema evolution across many files. Skipping only pays when data is clustered by the filter columns, so sort and partition deliberately. And because raw Parquet has no transactions or snapshots, table formats — Iceberg, Delta Lake, Hudi — wrap it with a manifest layer for ACID, time travel, safe schema evolution, and manifest-level pruning, which is what turns a pile of Parquet files into a lakehouse table. Now when you see a slow lake query, your first question should be: is the predicate pushed into the reader, and is the data clustered so the footer stats can actually skip row groups?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.