awesome-everything RU
↑ Back to the climb

Data Engineering

Parquet: why analytics stores columns, not rows

Crux Parquet is columnar, self-describing, and carries per-row-group min/max stats in its footer — so a filter can skip whole row groups and read only the columns it needs. CSV must parse every byte. The senior traps are small files, wrong row-group size, and dictionary blowup.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 17 min

A nightly Spark job writes the events table to S3 as gzip CSV. A simple SELECT user_id, country FROM events WHERE day = '2024-02-01' reads 1.9 GB, decompresses every byte, parses every column, throws away 38 of 40 columns, and takes 90 seconds. The analyst reruns it forty times tuning a dashboard. Same data as Parquet with snappy: 220 MB on disk, the engine reads two column chunks and one day’s row groups, and the query returns in under two seconds. Nothing about the data changed — only the file format.

Columns on disk, not rows

CSV and JSON store data row by row: every field of row 1, then every field of row 2. To read one column you still walk every byte of every row, parse the delimiters, and discard 95% of what you parsed. There are no statistics, no types — "2024" could be a string or a number, the reader can’t know without parsing it.

Parquet flips the layout. Inside each file, data is split into row groups (a horizontal slice of rows, default target ~128 MB), and inside a row group each column is stored as a contiguous column chunk, itself split into pages (the unit of encoding and compression, ~1 MB by default). Because all values of one column sit together, two things follow that are impossible in CSV. First, column pruning: a query that needs user_id and country reads only those two column chunks and never touches the other 38. Second, encoding gets brutally efficient — a million rows of country is mostly the same handful of strings, and a column of similar values compresses far better than a row of mixed types.

Every Parquet file ends with a footer: the schema, plus per-row-group, per-column metadata — and crucially min/max/null-count statistics for each column in each row group. This is the single most important structure for query performance.

When you run WHERE day = '2024-02-01', the engine reads the footer first, looks at each row group’s day min and max, and skips any row group whose range can’t contain the value without reading a single data byte. This is predicate pushdown plus row-group skipping. On a year of daily data sorted by day, that filter touches roughly 1 of 365 row groups. Page-level stats (and optional page indexes) push the same idea one level deeper, skipping pages inside a surviving row group. CSV can do none of this — no stats means the only way to know if a row matches is to read and parse it.

Why this works

Skipping only works if the data is laid out so the filter column’s min/max ranges are narrow and non-overlapping. Sort or partition by the columns you filter on most. If day values are randomly scattered across row groups, every row group’s [min, max] spans the whole year, no row group can be skipped, and you’ve paid for statistics that buy nothing. Stats are necessary but not sufficient — clustering is what makes them pay.

Encoding and compression: two separate wins

Parquet shrinks data in two stacked layers, and seniors keep them mentally separate. Encoding is type-aware and lossless-by-construction: dictionary encoding replaces repeated values with small integer codes (ideal for low-cardinality strings like country, status), run-length encoding (RLE) collapses runs of identical values, bit-packing uses only as many bits as the value range needs, and delta encoding stores differences for sorted integers and timestamps. Compression (snappy, zstd, gzip) then runs a general-purpose codec over the already-encoded bytes.

The numbers are why the format won. A CSV that is 1.9 GB lands around 1.2 GB as raw uncompressed Parquet (encoding alone), and with compression typically 9–15% of the original CSV size. snappy is the common default — fast, cheap CPU, ~8x ratio. zstd gives roughly 15–20% smaller files than snappy for under a 1% read penalty, which is why it’s the better choice for cold or archival data; gzip compresses tighter still but is markedly slower to read.

PropertyCSV / JSONParquet
LayoutRow-orientedColumnar (row group → column chunk → page)
Read only some columnsNo — full parseYes — column pruning
Skip rows by filterNo stats → read everythingFooter min/max → skip row groups + pages
Schema / typesNone — guessed at readEmbedded, typed, self-describing
Size vs CSV1x (baseline)~0.09–0.15x compressed

Where it bites in production

Parquet is not free of footguns, and every one of them shows up at scale, not in the demo.

The first is the small-files problem. Each Parquet file carries its own footer and costs one I/O to open and read. A streaming job that writes a 4 KB file every few seconds produces millions of tiny files, and now the slowest part of every query is listing them — recursively enumerating millions of objects in S3 is notoriously slow and expensive, and the query planner must read every footer before it can plan. The fix is compaction: periodically rewrite the swarm into files in the 128 MB – 1 GB range.

The second is row-group size. Too large and you lose granularity for skipping and need huge memory to write a single group; too small and metadata overhead balloons and you read more footers than data. The third is dictionary blowup: dictionary encoding is fantastic for low-cardinality columns, but on a high-cardinality column (a UUID, a free-text field) the dictionary grows past its page-size budget (~1 MB), Parquet falls back to plain encoding, and you’ve paid to build a dictionary that bought nothing — sometimes a larger file. The fourth is schema evolution: adding, renaming, or reordering columns across thousands of files leaves readers reconciling mismatched schemas, and a renamed-then-reused column name can silently mix two meanings.

Table formats: Parquet plus a brain

Raw Parquet is just files in a directory; it has no notion of a transaction, a snapshot, or “the current set of files.” This is exactly the gap Iceberg, Delta Lake, and Hudi fill. They keep your data in Parquet but add a metadata/manifest layer on top that tracks which files belong to the table right now, giving you ACID commits, time travel, safe schema evolution (column renames become metadata-only operations, not data rewrites), and manifest-level pruning — the engine consults the manifest’s per-file stats to eliminate whole files before it even opens a footer. When people say “data lakehouse,” this is the layer they mean: Parquet for storage, a table format for correctness and planning.

Pick the best fit

A Kafka consumer writes click events to S3 as Parquet, flushing a tiny file every 10 seconds. Dashboards over this table have gotten painfully slow. Pick the fix.

Quiz

A query is WHERE day = '2024-02-01' on a year of data. What lets Parquet read far less than CSV would?

Quiz

You enable dictionary encoding on a column of random UUIDs. What's the likely outcome?

Order the steps

Order the physical containment, outermost to innermost:

  1. 1 File (ends with a footer holding schema + per-row-group stats)
  2. 2 Row group (a horizontal slice of rows, ~128 MB target)
  3. 3 Column chunk (all values of one column within that row group, contiguous)
  4. 4 Page (the unit of encoding + compression, ~1 MB)
  5. 5 Encoded + compressed values (dictionary / RLE / delta, then snappy/zstd)
Recall before you leave
  1. 01
    Explain to a teammate, end to end, why a filtered query on Parquet reads so much less than the same query on CSV.
  2. 02
    What is the small-files problem, why does it cripple query planning, and how do table formats help?
Recap

Parquet stores data column by column instead of row by row, which makes two things possible that CSV and JSON cannot do: column pruning, where a query reads only the column chunks it needs, and row-group skipping, where the engine uses min/max/null-count statistics in the file footer to skip whole row groups (and pages) that can’t satisfy the filter — predicate pushdown. Inside, the layout nests file → row group (~128 MB) → column chunk → page (~1 MB), and each page is first encoded (dictionary, RLE, bit-packing, delta) and then compressed (snappy for hot data, zstd for ~15–20% smaller cold files), landing typically at 9–15% of the original CSV size. The senior failure modes are the small-files problem (millions of tiny files make listing and footer-reading dominate query planning — fix with compaction), wrong row-group size, dictionary blowup on high-cardinality columns (fallback to plain encoding, sometimes a larger file), and messy schema evolution across many files. Skipping only pays when data is clustered by the filter columns, so sort and partition deliberately. And because raw Parquet has no transactions or snapshots, table formats — Iceberg, Delta Lake, Hudi — wrap it with a manifest layer for ACID, time travel, safe schema evolution, and manifest-level pruning, which is what turns a pile of Parquet files into a lakehouse table.

Continue the climb ↑Parquet: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.