awesome-everything RU
↑ Back to the climb

Databases

Co-location and Citus: the invariant that makes sharding usable

Crux Co-location — same shard key means same worker — keeps tenant-scoped joins single-node. Citus''''s table types (distributed, reference, local) and query planner make the difference between single-shard and fan-out queries explicit.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 14 min

A team migrates to Citus and is thrilled that queries for individual tenants are as fast as before. Then they add a new table without the tenant_id column, and a new join query takes 500ms instead of 5ms. The schema review catches why: one table was not co-located.

Citus architecture

Citus is a Postgres extension that turns a cluster of Postgres instances into one logical sharded database:

  • Coordinator: one node that holds the cluster metadata (shard map, table distribution info), parses queries, plans the distributed execution, and routes results back to the client.
  • Workers: N nodes that hold the actual data (shards). Workers execute Postgres queries locally and return results to the coordinator.
  • Client: connects to the coordinator as if it were a normal Postgres. No driver changes required.
Client → Coordinator (holds metadata, plans, routes)
              ↓               ↓               ↓
          Worker 0        Worker 1        Worker 2
        (shards 0-10)  (shards 11-21)  (shards 22-31)

Three table types

TypeLives onUse forJoin cost
DistributedWorkers, split by shard keyAll tenant-scoped tables (orders, users, projects, …)Single-node if co-located; fan-out if not
ReferenceFull copy on every workerSmall mostly-read lookup tables (plans, feature_flags, countries)Always local; write via 2PC (slow)
LocalCoordinator onlyAdmin/control plane tables (tenants list, workers_meta)Cannot join with worker tables in a single query
-- Distribute tenant-scoped tables by tenant_id
SELECT create_distributed_table('users',       'tenant_id');
SELECT create_distributed_table('projects',    'tenant_id', colocate_with => 'users');
SELECT create_distributed_table('tasks',       'tenant_id', colocate_with => 'users');
SELECT create_distributed_table('comments',    'tenant_id', colocate_with => 'users');
SELECT create_distributed_table('attachments', 'tenant_id', colocate_with => 'users');

-- Replicate small lookup tables to every worker
SELECT create_reference_table('plans');
SELECT create_reference_table('feature_flags');
SELECT create_reference_table('countries');

-- tenants table stays local on coordinator (control plane)

Co-location: the central invariant

Co-location means that tables distributed by the same key have their corresponding shards on the same physical worker. If orders and payments are both distributed by tenant_id, then all of tenant 42’s orders and all of tenant 42’s payments live on Worker 1.

A query joining orders and payments filtered by tenant_id = 42:

  • With co-location: the planner pushes the join to Worker 1, which executes it like a normal Postgres join — one machine, full index use, single-digit milliseconds.
  • Without co-location (e.g., payments distributed by payment_id instead): the coordinator must gather partial results from all workers and merge them. P99 latency = slowest worker. Every worker participates. 10–100× slower.

Co-location is not a performance optimization — it is the design contract that makes a sharded system behave like a single Postgres for the 99% case.

Cross-shard queries and their mitigation

Queries without a shard key filter fan out to all shards. Examples:

  • “List all users with email ending in @enterprise.com” (no tenant_id)
  • “Count total orders today across all tenants” (cross-tenant analytics)
  • “Find a user by their email for login” (email, not tenant_id)

For OLTP:

  1. Redesign the API: almost all OLTP queries should carry tenant_id. Login lookups by email need a separate index or a small lookup service that resolves email → tenant_id first.
  2. CDC to OLAP: cross-tenant analytics should run on a separate analytical store (ClickHouse, BigQuery) fed by Change Data Capture. Never run global aggregates on the OLTP cluster.
  3. Rate-limit fan-out endpoints: for the rare legitimate cross-shard operation, rate-limit and document its cost.

The senior metric: cross-shard queries should be < 1–2% of OLTP traffic. Above that, the schema has drifted from co-location and needs review.

Why this works

Why does Citus default to 32 shards per table even on a 4-worker cluster? Shards are the unit of rebalancing — more shards means finer-grained rebalancing and a smoother distribution when you add workers. With 32 shards on 4 workers, each worker gets 8 shards; adding a 5th worker lets the rebalancer move some shards without cutting any shard in half. Teams often raise this to 64 or 128 when planning for larger clusters. The default of 32 is a starting point, not a mandate.

Quiz

What is the benefit of marking a table as a Citus reference table instead of a distributed table?

Quiz

A new engineer adds table 'audit_log' distributed by 'id' (not tenant_id) to a tenant-sharded Citus cluster. What breaks?

Recall before you leave
  1. 01
    Explain what co-location means in Citus and what happens to query performance when it is violated.
  2. 02
    What are the three Citus table types and which use case does each serve?
  3. 03
    How should a cross-tenant query like 'count all orders today' be handled in a tenant-sharded Citus cluster?
Recap

Citus adds a coordinator (metadata, planning, routing) and N workers (shard storage and execution) to a Postgres cluster, making it look like a single database. Tables are classified as distributed (sharded by key, on workers), reference (full copy on every worker for local joins), or local (coordinator only for control-plane data). Co-location — every tenant-scoped table distributed by the same key so matching shards land on the same worker — is the invariant that keeps tenant-scoped joins single-node and fast. Breaking co-location (distributing a table by a different key) turns every join involving that table into a cross-shard fan-out: N× the work, P99 = slowest worker. Cross-shard analytics should be offloaded to a dedicated OLAP store via CDC, never run on the OLTP cluster.

Connected lessons
appears again in258
Continue the climb ↑The hot-shard failure mode: detection, isolation, and durable policy
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.