Databases DB · 07 · 10

Sharding: provoke and tame a hot shard

Hands-on project — stand up a Citus cluster, provoke a hot shard on a power-law tenant, isolate it online, and prove the fix with before/after skew and latency numbers.

DB Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about hot shards is not the same as watching one shard climb to 94% CPU while the rest idle, then cutting the offending tenant onto its own shard with a sub-second pause. Build a small Citus cluster, drive it into the canonical sharding failure, and apply the unit’s playbook until the skew flattens — with evidence at every step.

Goal

Turn the unit’s mental model into a reproducible operational loop: design a co-located tenant-sharded schema, load a power-law tenant distribution, detect the hot shard from skew and per-tenant metrics, isolate the outlier online, and verify the fix with before/after numbers.

Project

0 of 7

Objective

Stand up a multi-worker Citus cluster with a co-located tenant_id-sharded schema, deliberately create a hot shard with a power-law tenant load, then detect and isolate the outlier tenant online — proving the skew and latency improvement with measurements, not estimates.

Requirements

Acceptance criteria

A before/after table: max/median shard-size skew ratio, hot-worker CPU %, whale P99, and small-tenant P99 — all measured under the same load, not estimated.
EXPLAIN output (or query plans) proving the co-located join is single-worker and the non-co-located join fans out to all workers.
Evidence the isolation was online: the measured write-pause is sub-second and the cluster served traffic throughout (no maintenance window).
A one-paragraph write-up naming why isolate_tenant_to_new_shard fixed the skew where citus_rebalance_start() would not have, and the leading indicators (skew ratio, per-tenant share) you would alert on to pre-empt the next incident.

Senior stretch

Add a tiered isolation policy as a script/cron: query per-tenant share from pg_stat_statements and auto-isolate any tenant crossing ~5% of cluster query time; demonstrate it firing on a synthetic growth pattern before CPU saturates.
Exercise online resharding: add a fourth worker and run citus_rebalance_start(); kill the coordinator mid-move and show the rebalancer resumes idempotently from pg_stat_subscription with no data loss.
Force a cross-shard transaction (update a co-located table plus a differently-keyed table in one BEGIN/COMMIT), observe the 2PC path, then crash the coordinator mid-prepare and demonstrate detecting and resolving the stuck entry in pg_prepared_xacts via COMMIT/ROLLBACK PREPARED.
Build the schema a second way with Citus 12 schema-based sharding (one schema per tenant) and compare onboarding/offboarding ergonomics and planning overhead against the row-based version.

Recap

This is the loop you will run in every real sharding incident: design for co-location so the common case stays single-node, instrument shard skew and per-tenant load as leading indicators, isolate the outlier tenant online rather than reaching for a rebalance that keeps it bundled, and verify with before/after numbers under identical load. Doing it once on a toy Citus cluster makes the production version — where the whale tenant is a paying enterprise account and the on-call clock is running — muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.