Databases DB · 07 · 05

The hot-shard failure mode: detection, isolation, and durable policy

Hash sharding produces hot shards on skewed key distributions. Citus''''s isolate_tenant_to_new_shard() is the production mitigation; skew monitoring and a tenant-tier policy are the durable fix.

DB Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A Citus cluster has 32 shards across 4 workers. Skew alert fires: shard 102008 is at 94% CPU, the other 31 shards average 18%. Customer support is fielding latency complaints from one enterprise account. The entire cluster is running at the capacity of one shard.

Why hash sharding does not prevent hot shards

Hash sharding (shard = hash(tenant_id) mod N) distributes evenly when keys are uniformly distributed. Real-world tenants are power-law distributed: one or two customers generate 60–80% of traffic; the long tail generates the rest.

Hash routing assigns each tenant to exactly one shard. A tenant generating 60% of traffic means their shard runs at 60% of cluster capacity while every other shard sits near-idle. The cluster is effectively a single-shard cluster for that customer’s workload.

The problem is structural, not a bug: hash sharding was the right choice (uniform distribution across the tail), but the design is incomplete without an outlier-isolation policy.

Detection: what to monitor

Signal	How to measure	Alert threshold
Shard size skew	SELECT shardid, table_size FROM citus_shards ORDER BY table_size DESC	max/median > 1.5×
Per-shard CPU	Postgres exporter per worker + Prometheus	any worker > 70% sustained 5 min
Per-tenant query load	pg_stat_statements on coordinator, aggregated by tenant_id label	single tenant > 5% of cluster total_exec_time
Customer P99 latency	APM per tenant endpoint	P99 > 3× cluster median

The leading indicators are shard size skew and per-tenant query rate growth — both visible weeks before the CPU alert fires. Teams that monitor weekly trends pre-empt hot-shard incidents; teams that only monitor CPU react to them. When you set up a new Citus cluster, these four signals should be wired into your alerting on day one — before the first enterprise customer signs.

Immediate mitigation: Citus tenant isolation

When one tenant is the culprit, move them to a dedicated shard:

-- Move tenant 9821 to its own shard (online, no downtime)
SELECT isolate_tenant_to_new_shard('orders', 9821, 'CASCADE');

What Citus does:

Creates a new shard.
Sets up logical replication from the old shard to the new one, copying only tenant 9821’s rows.
Once caught up, briefly pauses writes to tenant 9821 (sub-second), switches the metadata pointer.
Resumes. The old shard no longer contains tenant 9821’s data.

Total time: minutes to hours depending on tenant data volume (~10–100 MB/s throughput). The write pause is sub-second. The original shard’s CPU immediately drops to cluster median after cut-over.

For app-level sharding without Citus: the equivalent is a directory-remap — update the shard map for that tenant to point at a new Postgres, dual-write during the transition, cut over reads. This requires pre-built tooling; teams without it spend days on the emergency.

Durable policy: automation, not reaction

A single hot-shard incident is acceptable. Two is a process failure. The durable fix is a tiered policy:

Alert on skew (max/median > 1.5×) and per-tenant load (> 5% of cluster). These are leading indicators.
Automate isolation for tenants crossing the 5% threshold: a background job invokes isolate_tenant_to_new_shard. No human required for routine cases.
Escalate tenants above 20% to a dedicated worker; above 50% to a dedicated cluster + customer conversation.
Pre-isolate new enterprise accounts above a size threshold during onboarding — do not wait for a traffic spike.
Drill the runbook quarterly in staging so on-call engineers can isolate a tenant in under 15 minutes.

Together these five steps mean: no single engineer needs to make a judgment call under production pressure — the policy decides for them. Without step 2 (automation), you will be woken up at 3am repeatedly; without step 4 (pre-isolation), your enterprise onboarding becomes a hot-shard time bomb.

The durable fix is a load-threshold ladder, not a 3am judgment call: 5% auto-isolates, 20% earns a dedicated worker, 50% earns a dedicated cluster. The policy decides for the on-call engineer.

▸Why this works

Why does Citus use logical replication for tenant isolation instead of a physical copy? Logical replication copies row-level changes (INSERT/UPDATE/DELETE) from source to destination selectively — it can filter to copy only rows where tenant_id = 9821. Physical replication copies every page of the source shard, including rows from other tenants. For per-tenant isolation, logical replication is the only practical option: it moves exactly the right data without duplicating the other tenants’ rows, and it stays in sync until the cut-over moment with no read-only window.

Quiz

One shard in a Citus cluster is at 94% CPU while all others average 18%. pg_stat_statements shows one tenant generating 62% of all query time on that shard. What is the correct immediate action?

Order the steps

Order the hot-shard response steps from detection to durable fix:

1 Alert fires: shard CPU > 70% sustained or max/median skew > 1.5×
2 Identify the dominant tenant on the hot shard via pg_stat_statements aggregated by tenant_id
3 Invoke isolate_tenant_to_new_shard for that tenant; monitor replication lag until cut-over
4 Confirm the original shard CPU drops to cluster median after cut-over
5 Post-mortem: was this tenant growing for weeks? Update monitoring thresholds
6 Update the isolation policy to auto-isolate tenants above 5% cluster load before the next incident

Hash routing sends a power-law tenant to one shard: it saturates at 94% CPU while the rest idle near 18%. The cluster effectively runs at single-shard capacity until that tenant is isolated to a dedicated shard.

Recall before you leave

01
Why does hash sharding not prevent hot shards in a multi-tenant B2B SaaS?
02
Walk through what Citus's isolate_tenant_to_new_shard does and what its downtime profile is.
03
What are the three monitoring signals that catch hot-shard skew before it causes customer-visible latency?

Recap

The hot-shard failure mode is a structural consequence of power-law tenant distribution on hash sharding: one tenant’s shard saturates while others sit idle. Detection requires proactive monitoring of shard size skew (max/median ratio) and per-tenant query load — both visible weeks before the CPU alert fires. The immediate mitigation is Citus’s isolate_tenant_to_new_shard, which moves the outlier tenant’s data to a dedicated shard via logical replication with a sub-second write pause. The durable fix is a tiered automation policy: alert on skew thresholds, automatically isolate tenants above 5% cluster load, escalate very large tenants to dedicated workers or clusters. Hot-shard incidents that surprise a team are an ops process failure; teams with the policy treat them as routine scheduled isolations. Now when you see one worker’s CPU towering over the rest in your monitoring dashboard, you know the diagnosis (power-law skew), the command (isolate_tenant_to_new_shard), and the durable fix (policy automation).

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Co-location and Citus: the invariant that makes sharding usablemiddle

unlocks

Schema-based sharding and multi-tenancy alternativessenior

deepens into

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.