Databases DB · 07 · 04

Co-location and Citus: the invariant that makes sharding usable

Co-location — same shard key means same worker — keeps tenant-scoped joins single-node. Citus''''s table types (distributed, reference, local) and query planner make the difference between single-shard and fan-out queries explicit.

DB Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A team migrates to Citus and is thrilled that queries for individual tenants are as fast as before. Then they add a new table without the tenant_id column, and a new join query takes 500ms instead of 5ms. The schema review catches why: one table was not co-located.

Citus architecture

Why does Citus matter when you could just handle routing in application code? Because getting the coordinator, worker, and table-type model right is what separates a sharded system that stays fast from one that silently fans out every query.

Citus is a Postgres extension that turns a cluster of Postgres instances into one logical sharded database:

Coordinator: one node that holds the cluster metadata (shard map, table distribution info), parses queries, plans the distributed execution, and routes results back to the client.
Workers: N nodes that hold the actual data (shards). Workers execute Postgres queries locally and return results to the coordinator.
Client: connects to the coordinator as if it were a normal Postgres. No driver changes required.

Client → Coordinator (holds metadata, plans, routes)
              ↓               ↓               ↓
          Worker 0        Worker 1        Worker 2
        (shards 0-10)  (shards 11-21)  (shards 22-31)

Three table types

Type	Lives on	Use for	Join cost
Distributed	Workers, split by shard key	All tenant-scoped tables (orders, users, projects, …)	Single-node if co-located; fan-out if not
Reference	Full copy on every worker	Small mostly-read lookup tables (plans, feature_flags, countries)	Always local; write via 2PC (slow)
Local	Coordinator only	Admin/control plane tables (tenants list, workers_meta)	Cannot join with worker tables in a single query

Together these three types cover the full design surface: distributed for your core data, reference for the small read-heavy lookups that need to join everywhere, and local for the control-plane records that orchestrate the cluster itself. A table in the wrong category silently blows up join cost.

-- Distribute tenant-scoped tables by tenant_id
SELECT create_distributed_table('users',       'tenant_id');
SELECT create_distributed_table('projects',    'tenant_id', colocate_with => 'users');
SELECT create_distributed_table('tasks',       'tenant_id', colocate_with => 'users');
SELECT create_distributed_table('comments',    'tenant_id', colocate_with => 'users');
SELECT create_distributed_table('attachments', 'tenant_id', colocate_with => 'users');

-- Replicate small lookup tables to every worker
SELECT create_reference_table('plans');
SELECT create_reference_table('feature_flags');
SELECT create_reference_table('countries');

-- tenants table stays local on coordinator (control plane)

Co-location: the central invariant

Co-location means that tables distributed by the same key have their corresponding shards on the same physical worker. If orders and payments are both distributed by tenant_id, then all of tenant 42’s orders and all of tenant 42’s payments live on Worker 1.

A query joining orders and payments filtered by tenant_id = 42:

With co-location: the planner pushes the join to Worker 1, which executes it like a normal Postgres join — one machine, full index use, single-digit milliseconds.
Without co-location (e.g., payments distributed by payment_id instead): the coordinator must gather partial results from all workers and merge them. P99 latency = slowest worker. Every worker participates. 10–100× slower.

The same join, two orders of magnitude apart. Co-location is what keeps it on one worker at single-digit milliseconds; lose it and the coordinator fans out to every worker — ~100x slower.

Co-location is not a performance optimization — it is the design contract that makes a sharded system behave like a single Postgres for the 99% case.

Cross-shard queries and their mitigation

Queries without a shard key filter fan out to all shards. Examples:

“List all users with email ending in @enterprise.com” (no tenant_id)
“Count total orders today across all tenants” (cross-tenant analytics)
“Find a user by their email for login” (email, not tenant_id)

For OLTP:

Redesign the API: almost all OLTP queries should carry tenant_id. Login lookups by email need a separate index or a small lookup service that resolves email → tenant_id first.
CDC to OLAP: cross-tenant analytics should run on a separate analytical store (ClickHouse, BigQuery) fed by Change Data Capture. Never run global aggregates on the OLTP cluster.
Rate-limit fan-out endpoints: for the rare legitimate cross-shard operation, rate-limit and document its cost.

The senior metric: cross-shard queries should be < 1–2% of OLTP traffic. Above that, the schema has drifted from co-location and needs review. When you see that ratio climbing past 2%, treat it as a schema bug — audit which tables lack a co-located distribution key and fix the model before the fan-out becomes the dominant cost.

▸Why this works

Why does Citus default to 32 shards per table even on a 4-worker cluster? Shards are the unit of rebalancing — more shards means finer-grained rebalancing and a smoother distribution when you add workers. With 32 shards on 4 workers, each worker gets 8 shards; adding a 5th worker lets the rebalancer move some shards without cutting any shard in half. Teams often raise this to 64 or 128 when planning for larger clusters. The default of 32 is a starting point, not a mandate.

Quiz

What is the benefit of marking a table as a Citus reference table instead of a distributed table?

Quiz

A new engineer adds table 'audit_log' distributed by 'id' (not tenant_id) to a tenant-sharded Citus cluster. What breaks?

Same shard key means matching shards land on the same worker, so a tenant-scoped join of orders and payments stays single-node. Distribute one table by a different key and the join fans out to every worker — 10-100x slower.

Recall before you leave

01
Explain what co-location means in Citus and what happens to query performance when it is violated.
02
What are the three Citus table types and which use case does each serve?
03
How should a cross-tenant query like 'count all orders today' be handled in a tenant-sharded Citus cluster?

Recap

Citus adds a coordinator (metadata, planning, routing) and N workers (shard storage and execution) to a Postgres cluster, making it look like a single database. Tables are classified as distributed (sharded by key, on workers), reference (full copy on every worker for local joins), or local (coordinator only for control-plane data). Co-location — every tenant-scoped table distributed by the same key so matching shards land on the same worker — is the invariant that keeps tenant-scoped joins single-node and fast. Breaking co-location (distributing a table by a different key) turns every join involving that table into a cross-shard fan-out: N× the work, P99 = slowest worker. Cross-shard analytics should be offloaded to a dedicated OLAP store via CDC, never run on the OLTP cluster. Now when you review a new Citus table definition, the first question is always: what is its distribution key, and is it co-located with the tables it will join?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Partitioning vs sharding: same word, two different thingsmiddle

unlocks

deepens into

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.