Databases DB · 07 · 07

Online resharding, 2PC, and the operational cost of sharding

Citus 11.1+ online rebalancing uses logical replication for sub-second shard-move pauses. Two-phase commit is necessary for cross-shard transactions but carries stuck-transaction risk. Every operational task multiplies by shard count.

DB Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A Citus cluster needs to add a worker node to handle capacity growth. The old approach required a maintenance window: halt writes, copy data, restart. Citus 11.1 changed this with an online rebalancer that moves shards while the cluster stays fully operational, using the same logical replication mechanism that Postgres uses for upgrades.

Online resharding: how it works

Citus 11.1 (September 2022) introduced non-disruptive online resharding. The mechanism:

1. Start rebalancer: SELECT citus_rebalance_start();
2. Planner picks: which shards move from worker A to worker B for balance
3. For each shard being moved:
   a. Create the shard on the target worker (empty)
   b. Start logical replication slot on source for that shard's rows
   c. Logical replication copies existing rows to target, then streams changes
   d. Once lag is near zero: pause writes to shard (~sub-second)
   e. Atomically update shard-map metadata: shard now lives on target
   f. Drop the source shard
   g. Resume writes — now routed to target
4. Next shard begins (pipeline of moves in parallel, configurable concurrency)

Performance profile:

Per-shard write pause: under 1 second
Total rebalance time: scales with data volume (~10–100 MB/s network throughput)
Application impact: brief connection stall during the per-shard pause; retried transparently by the coordinator

The rebalancer is idempotent and resumable — if it crashes mid-move, the next citus_rebalance_start() picks up from where it left off using pg_stat_subscription state.

Senior teams rehearse the rebalancer in staging before running in production. The first execution should not be an emergency.

Distributed transactions and 2PC

A transaction touching rows on multiple shards requires two-phase commit (2PC):

-- Single-shard: safe, normal Postgres transaction
BEGIN;
UPDATE orders SET status = 'shipped' WHERE tenant_id = 42 AND order_id = 99;
UPDATE inventory SET qty = qty - 1 WHERE tenant_id = 42 AND sku = 'ABC';
COMMIT;
-- Both tables are co-located on the same shard → single-node transaction

-- Cross-shard: requires 2PC
BEGIN;
UPDATE orders  SET status = 'refunded' WHERE tenant_id = 42 AND order_id = 99;
UPDATE ledger  SET balance = balance + 100 WHERE account_id = 999;  -- NOT tenant-scoped
COMMIT;
-- Citus uses PREPARE TRANSACTION + COMMIT PREPARED across both shards

2PC cost:

Latency: minimum 2 round-trips (prepare phase + commit phase); often more with clock skew handling.
Throughput: each shard holds its connection open during the prepare phase — contention under high concurrency.
In-doubt transactions: if the coordinator crashes between PREPARE TRANSACTION and COMMIT PREPARED, the prepared transactions sit in pg_prepared_xacts holding locks until manually resolved. This is the failure mode to monitor and have a runbook for.

Co-located, single-shard transactions are one cheap COMMIT; cross-shard 2PC doubles the round-trips and adds in-doubt risk — which is why co-location is the senior default.

Best practices:

Design transactions to be single-shard: co-located tables make this natural for tenant-scoped operations.
Prefer eventual consistency (outbox pattern, saga) where the business logic permits rather than distributed 2PC.
Monitor pg_prepared_xacts on every worker; alert on any row older than 5 minutes; have a recovery runbook (COMMIT PREPARED / ROLLBACK PREPARED).

Together these three practices mean: if you have built the system correctly, you should rarely need 2PC at all. When you do see a cross-shard transaction in the codebase, ask whether the table can be co-located — that is almost always the better fix.

The N× operational multiplier

Operation	Single Postgres	Sharded (N shards)
Schema migration	Run once	Citus auto-propagates DDL from coordinator to workers; verify schema consistency post-migration across all workers
Backups	One pg_basebackup	N coordinated backups; all must target the same logical point-in-time for consistent restore
Minor version upgrade	One rolling update	N rolling updates, coordinated; Citus version must be compatible across coordinator and all workers
VACUUM / bloat	One autovacuum config	N separate autovacuum processes; each shard accumulates dead tuples independently
Monitoring	One dashboard	Per-shard metrics + cluster-wide aggregates; shard skew visible in heatmap

Senior teams automate everything via per-shard orchestration — Ansible with a loop over worker hosts, Terraform for_each, Kubernetes operators — before going live. Without automation, ops staff scales linearly with shard count, which is the operational tax of sharding made concrete.

Resharding the key: the expensive case

Sometimes the original shard key turns out wrong (sharded by user_id but 90% of queries need org_id). Re-sharding by a different key requires rewriting every row to its new location:

Dual-write + backfill: app writes to both old and new sharding schemes; backfill historical data; cut over reads; drop old. Months of work.
New parallel cluster: stand up a new cluster with the correct key, replicate via CDC, cut over. Most flexible; most operational work.
Accept the cost: in practice, teams endure the wrong shard key rather than pay the migration cost — a strong argument for choosing carefully from the start.

The senior takeaway: shard-key change is among the most expensive operations in databases. The design decision made at scale-out time is lived with for years.

▸Why this works

Why does online resharding use logical replication rather than physical replication? Physical replication (pg_basebackup, streaming) copies entire data pages — you would get every tenant’s data on the source shard, not just the one being moved. Logical replication copies row-level changes (INSERT/UPDATE/DELETE), filtered by the rows you care about. For shard moves, logical replication can copy exactly the rows in the shard being moved and stream only their changes until cut-over — far more efficient and flexible than physical copy.

Quiz

A Citus rebalancer is running and the coordinator crashes mid-shard-move. What happens to the cluster?

Order the steps

Order the steps of a Citus online shard-move (from source worker to target worker):

1 Create empty shard on target worker
2 Start logical replication slot on source; copy existing rows to target
3 Stream ongoing changes from source to target until lag is near zero
4 Pause writes to this shard (sub-second)
5 Atomically update shard-map metadata: shard now points to target worker
6 Resume writes — now routed to target; drop source shard

The bulk of the data moves during streaming replication while the cluster stays online; only the final cut-over pauses writes, and only for under a second.

Recall before you leave

01
Walk through the per-shard write pause during Citus online rebalancing: how long is it and why is it sub-second?
02
What is the in-doubt transaction risk in Citus 2PC and how do you monitor and recover from it?
03
Why is sharding described as a 'one-way door' and what operational consequences follow from that?

Recap

Citus 11.1+ online rebalancing uses logical replication to move shards between workers with a sub-second per-shard write pause — rehearse it in staging before running in production. Cross-shard transactions require two-phase commit, which adds ~2× latency, reduces throughput under concurrency, and creates in-doubt-transaction risk on coordinator crash; monitor pg_prepared_xacts and maintain a recovery runbook. The N× operational multiplier means migrations, backups, upgrades, and VACUUM must all be automated across every shard before going live — without automation, ops staff scales linearly with shard count. Changing the shard key after launch is a months-long project; the design choice made at scale-out time is lived with for years. Sharding is the right answer when single-Postgres limits are measured and proven — entered deliberately, with automation, runbooks, and rehearsed operational procedures. Now when you see a pull request adding a new sharding layer, the checklist is clear: is the shard key right, is co-location maintained, are pg_prepared_xacts monitored, and has the rebalancer been drilled in staging?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Schema-based sharding and multi-tenancy alternativessenior

unlocks

Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.