awesome-everything RU
↑ Back to the climb

Databases

Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascade

Crux Sharding is the last resort — chosen when single-Postgres limits are measured and proven. The shard key is irreversible, co-location determines which joins stay local, and the tradeoff cascade shows why skipping any earlier act makes sharding harder.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

Citus is deployed. Shard key is tenant_id. Most tenants are fast. Tenant Acme — 40% of all queries — saturates its shard while the other five shards sit idle. The cluster is effectively single-shard for Acme. The hot-shard failure was not caused by sharding; it was designed in on the day the shard key was chosen.

Act 7 — Three independent sharding decisions

Sharding at year-3 involves three decisions made independently:

Decision 1: Shard key. The shard key determines data distribution and query locality. For B2B multi-tenant SaaS, tenant_id is the natural choice: all queries for one tenant stay on one shard. For B2C, user_id. For time-series, a time bucket. The key must have high cardinality (so load spreads), uniform access patterns (so no single shard is hot), and be cheap to compute (so routing overhead is minimal).

Decision 2: Distribution method.

  • Hash: rows are hashed to shards — even spread, range queries fan out across all shards.
  • Range: rows in a key range go to a specific shard — range queries stay local, but a hot key range creates a hot shard.
  • List: manual assignment — flexible, high operational cost.
  • Directory: a lookup table maps key values to shards — most flexible, highest operational cost.

Decision 3: Co-location. Tables that join together must be placed on the same shard key so cross-shard joins stay local. In Citus: users, orgs, events, audit_log — all sharded by tenant_id — are co-located. A join between users and events for tenant 42 hits a single shard and returns at local latency. Without co-location, the same join fans out to all shards, collects results at the coordinator, and merges.

The federation tax

Once sharded, queries that span multiple shards must be federated: send the query to each shard, collect results, and merge on the application side or at the Citus coordinator.

A naive outer join across two shards becomes two inner queries (one per shard) and a client-side join. A GROUP BY that involves shards becomes a GROUP BY per shard, then a second GROUP BY on the aggregates. Each cross-shard query has latency = max(latency per shard) + merge time; tail latency dominates.

Citus handles federation with distributed query planning, but only for co-located tables (those with the same shard key). Reference tables (replicated to every worker node) are free — a small dimension table like countries or currencies can be replicated once and joined locally everywhere.

The hot-shard fix

When Acme accounts for 40% of all queries on a tenant_id-sharded cluster:

  • Option A — logical split: Route Acme’s queries to a dedicated set of logical shards. The tenant-aware router multiplexes Acme across shard IDs that map to a dedicated physical worker. Other tenants share the remaining workers.
  • Option B — physical cluster: Move Acme to its own Postgres cluster. Queries for Acme route there; the shared cluster handles all other tenants.

Both options require online resharding (citus_rebalance_table_shards), which uses logical replication for sub-second per-shard write pauses. Plan the window — catch-up replication adds load.

The seven-tier tradeoff cascade

Each act unlocks the next, but only if the previous is sound:

  • Skip Act 1 (schema) → Acts 2–7 fight inefficient joins and mutable natural keys.
  • Skip Act 2 (indexing) → Act 3 planning decisions are made blind; statistics are useless if the planner has no indexes to choose between.
  • Skip Act 3 (stats) → Act 4 vacuum is the only lever; tables bloat while you guess at cardinality.
  • Skip Act 4 (bloat) → Act 5 pool threads see table scans that are slower because dead tuples inflate heap pages.
  • Skip Act 5 (pool sizing) → Act 6 migrations starve under connection storms; the connection handshake backlog prevents the migrate from acquiring locks in a timely window.
  • Skip Act 6 (lock safety) → Act 7 sharding is impossible without downtime; every shard rebalance requires an exclusive lock if migrations were not designed for online operation.
  • Skip Act 7 (when needed) → one tenant dominates the cluster and the others wait in queue.

The order is a constraint imposed by physics and Postgres internals, not a suggestion.

Act 7 sharding reference numbers
Single-node ceiling, write-heavy OLTP (2026 hardware)
10–50K writes/s sustained
Full scan of 1B-row table on SSD
5–10 min
Online shard-move write pause (Citus 11.1+)
under 1 second per shard
Schema-based sharding: practical tenant ceiling per cluster
1000–3000 tenants
Shard key change project duration
months (dual-write + backfill + cutover)
Why this works

Schema-based multi-tenancy (“one schema per tenant”) works at low tenant count: clean isolation, easy backup, simple data export. But Postgres catalog tables (pg_class, pg_attribute, pg_constraint) grow linearly with schema count × tables per schema. At 10000 schemas × 50 tables, pg_class has half a million rows; planner walk-time on queries that touch multiple schemas climbs to seconds. The practical ceiling is 1000–3000 tenants per cluster with schema-based sharding. Past that, row-level multi-tenancy (tenant_id on every table) with Citus shard distribution scales further.

Quiz

A Citus cluster shards by tenant_id. A query joins users and events for a single tenant. Will this query pay the federation tax?

Quiz

The team skipped Act 6 (safe migration patterns). They now need to add a column to a sharded table of 1B rows. What is the consequence?

Quiz

Why is changing the shard key after launch described as 'among the most expensive operations in databases'?

Recall before you leave
  1. 01
    Name the three independent decisions in sharding and explain why the shard key is the most consequential.
  2. 02
    What is the federation tax and when does co-location eliminate it?
  3. 03
    Trace the tradeoff cascade: pick Acts 1 and 6, describe how skipping them makes Act 7 more expensive.
Recap

Act 7 involves three decisions: shard key (largely irreversible — determines data distribution and join locality), distribution method (hash for even spread, range for range-query locality, list/directory for manual control), and co-location (tables that join together must share a shard key to avoid the federation tax). The hot-shard failure mode — one tenant saturating a shard — is fixed by logical tenant split or physical cluster isolation, both via online resharding. The tradeoff cascade is the proof that acts compound: skip Act 1 and the shard key choice is poisoned by the wrong schema; skip Act 6 and every shard operation is a downtime event. The 2026 pragmatic default for a Postgres shop scaling past one node is declarative partitioning first, Citus when single-node limits are measured and proven, and Aurora DSQL or Spanner only when global writes are a real product requirement.

Connected lessons
appears again in263
Continue the climb ↑Observability, anti-patterns, and production triage
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.