awesome-everything RU
↑ Back to the climb

Databases

The hot-shard failure mode: detection, isolation, and durable policy

Crux Hash sharding produces hot shards on skewed key distributions. Citus''''s isolate_tenant_to_new_shard() is the production mitigation; skew monitoring and a tenant-tier policy are the durable fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

A Citus cluster has 32 shards across 4 workers. Skew alert fires: shard 102008 is at 94% CPU, the other 31 shards average 18%. Customer support is fielding latency complaints from one enterprise account. The entire cluster is running at the capacity of one shard.

Why hash sharding does not prevent hot shards

Hash sharding (shard = hash(tenant_id) mod N) distributes evenly when keys are uniformly distributed. Real-world tenants are power-law distributed: one or two customers generate 60–80% of traffic; the long tail generates the rest.

Hash routing assigns each tenant to exactly one shard. A tenant generating 60% of traffic means their shard runs at 60% of cluster capacity while every other shard sits near-idle. The cluster is effectively a single-shard cluster for that customer’s workload.

The problem is structural, not a bug: hash sharding was the right choice (uniform distribution across the tail), but the design is incomplete without an outlier-isolation policy.

Detection: what to monitor

SignalHow to measureAlert threshold
Shard size skewSELECT shardid, table_size FROM citus_shards ORDER BY table_size DESCmax/median > 1.5×
Per-shard CPUPostgres exporter per worker + Prometheusany worker > 70% sustained 5 min
Per-tenant query loadpg_stat_statements on coordinator, aggregated by tenant_id labelsingle tenant > 5% of cluster total_exec_time
Customer P99 latencyAPM per tenant endpointP99 > 3× cluster median

The leading indicators are shard size skew and per-tenant query rate growth — both visible weeks before the CPU alert fires. Teams that monitor weekly trends pre-empt hot-shard incidents; teams that only monitor CPU react to them.

Immediate mitigation: Citus tenant isolation

When one tenant is the culprit, move them to a dedicated shard:

-- Move tenant 9821 to its own shard (online, no downtime)
SELECT isolate_tenant_to_new_shard('orders', 9821, 'CASCADE');

What Citus does:

  1. Creates a new shard.
  2. Sets up logical replication from the old shard to the new one, copying only tenant 9821’s rows.
  3. Once caught up, briefly pauses writes to tenant 9821 (sub-second), switches the metadata pointer.
  4. Resumes. The old shard no longer contains tenant 9821’s data.

Total time: minutes to hours depending on tenant data volume (~10–100 MB/s throughput). The write pause is sub-second. The original shard’s CPU immediately drops to cluster median after cut-over.

For app-level sharding without Citus: the equivalent is a directory-remap — update the shard map for that tenant to point at a new Postgres, dual-write during the transition, cut over reads. This requires pre-built tooling; teams without it spend days on the emergency.

Durable policy: automation, not reaction

A single hot-shard incident is acceptable. Two is a process failure. The durable fix is a tiered policy:

  1. Alert on skew (max/median > 1.5×) and per-tenant load (> 5% of cluster). These are leading indicators.
  2. Automate isolation for tenants crossing the 5% threshold: a background job invokes isolate_tenant_to_new_shard. No human required for routine cases.
  3. Escalate tenants above 20% to a dedicated worker; above 50% to a dedicated cluster + customer conversation.
  4. Pre-isolate new enterprise accounts above a size threshold during onboarding — do not wait for a traffic spike.
  5. Drill the runbook quarterly in staging so on-call engineers can isolate a tenant in under 15 minutes.
Why this works

Why does Citus use logical replication for tenant isolation instead of a physical copy? Logical replication copies row-level changes (INSERT/UPDATE/DELETE) from source to destination selectively — it can filter to copy only rows where tenant_id = 9821. Physical replication copies every page of the source shard, including rows from other tenants. For per-tenant isolation, logical replication is the only practical option: it moves exactly the right data without duplicating the other tenants’ rows, and it stays in sync until the cut-over moment with no read-only window.

Quiz

One shard in a Citus cluster is at 94% CPU while all others average 18%. pg_stat_statements shows one tenant generating 62% of all query time on that shard. What is the correct immediate action?

Order the steps

Order the hot-shard response steps from detection to durable fix:

  1. 1 Alert fires: shard CPU > 70% sustained or max/median skew > 1.5×
  2. 2 Identify the dominant tenant on the hot shard via pg_stat_statements aggregated by tenant_id
  3. 3 Invoke isolate_tenant_to_new_shard for that tenant; monitor replication lag until cut-over
  4. 4 Confirm the original shard CPU drops to cluster median after cut-over
  5. 5 Post-mortem: was this tenant growing for weeks? Update monitoring thresholds
  6. 6 Update the isolation policy to auto-isolate tenants above 5% cluster load before the next incident
Recall before you leave
  1. 01
    Why does hash sharding not prevent hot shards in a multi-tenant B2B SaaS?
  2. 02
    Walk through what Citus's isolate_tenant_to_new_shard does and what its downtime profile is.
  3. 03
    What are the three monitoring signals that catch hot-shard skew before it causes customer-visible latency?
Recap

The hot-shard failure mode is a structural consequence of power-law tenant distribution on hash sharding: one tenant’s shard saturates while others sit idle. Detection requires proactive monitoring of shard size skew (max/median ratio) and per-tenant query load — both visible weeks before the CPU alert fires. The immediate mitigation is Citus’s isolate_tenant_to_new_shard, which moves the outlier tenant’s data to a dedicated shard via logical replication with a sub-second write pause. The durable fix is a tiered automation policy: alert on skew thresholds, automatically isolate tenants above 5% cluster load, escalate very large tenants to dedicated workers or clusters. Hot-shard incidents that surprise a team are an ops process failure; teams with the policy treat them as routine scheduled isolations.

Connected lessons
appears again in258
Continue the climb ↑Schema-based sharding and multi-tenancy alternatives
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.