Distributed Systems DIST · 03 · 01

Quorums: the R + W > N invariant and how it quietly breaks

A quorum is a tunable promise: if read replicas and write replicas overlap (R + W > N), a read sees the latest write. Sloppy quorums and W=1 trade that promise away — and that is where prod loses writes.

DIST Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A payments team runs Cassandra at RF=3 and, chasing write latency, sets writes to consistency ONE (W=1) and reads to ONE (R=1). For months it is fast and fine. Then one node hiccups during a deploy. A user updates their payout account; the single ack came from a node that died seconds later before replicating. The next read hits a different replica and serves the old account number. The write was acknowledged, durable to nobody, and silently gone. R=1 + W=1 means R + W = 2, and 2 is not greater than 3 — there was never a guarantee the read and the write touched the same node.

In the next ten minutes you will see exactly why that arithmetic matters, where it silently breaks in production, and which knob to reach for first when a stale-read ticket lands.

The overlap invariant is the whole game

Replicate one key to N nodes. A write succeeds when W of them acknowledge it; a read succeeds when R of them respond and you take the newest version. The single fact that makes this useful: if R + W > N, the set of nodes that acked the write and the set that answered the read must share at least one node by the pigeonhole principle — and that shared node carries the latest value. That is the entire guarantee. There is no magic, no consensus log, no leader: just two subsets of N forced to intersect.

Flip it around and the failure is just as mechanical. If R + W <= N, the two sets can be disjoint. A read is then allowed to pick R nodes that all missed the last write, and it will hand back stale data while believing it succeeded. Nothing logged an error, because nothing was wrong by the rules you configured — you just configured away the guarantee.

The canonical strong-consistency setting at N=3 is R=2, W=2: 2 + 2 = 4 > 3, so every read overlaps every write by one node. This is exactly what Cassandra’s QUORUM / LOCAL_QUORUM (floor(N/2) + 1 = 2 at RF=3) gives you, and what DynamoDB calls a strongly consistent read.

Config (N=3)	R + W	Overlap?	What you actually get
`W=2, R=2` (QUORUM)	4 > 3	Yes	Read sees latest write; survives 1 node down
`W=3, R=1` (write-ALL)	4 > 3	Yes	Fast reads, but ANY node down blocks writes
`W=1, R=3` (read-ALL)	4 > 3	Yes	Fast writes, but ANY node down blocks reads
`W=1, R=1` (ONE/ONE)	2 <= 3	No	Fastest + most available; stale reads are legal

Three configs clear the N=3 bar with R + W = 4, so reads and writes are forced to overlap. ONE/ONE lands at 2 — below the line — which is why its stale reads are legal, not bugs.

Tunable consistency: you pick the corner of the triangle

When you design a service backed by a Dynamo-style store, the first question is not “which database?” — it is “which corner of the consistency/availability/latency triangle do I need for each operation?” R and W give you that choice per call.

R and W are knobs, often per-query, not a global mode. That is the point of “tunable consistency” — the same N=3 cluster can serve a W=1 fire-and-forget metrics write and a W=2,R=2 strongly-consistent balance read in the same process. You are choosing, per operation, where you land on the consistency/availability/latency triangle.

Two extremes show why nobody runs them in production. W=N (write to ALL) gives the strongest write but kills write availability: with W=3 at RF=3, a single node being down — a routine deploy, a GC pause, a slow disk — fails every write, because all three must ack. Symmetrically R=N blocks all reads on one slow node. Quorum (W=2,R=2) is the sweet spot precisely because it tolerates one node down on both paths while still guaranteeing overlap. That single-node-failure tolerance is why “quorum” became the default mental model rather than ALL.

▸Why this works

Overlap guarantees you read the latest write, but not that reads are monotonic or that concurrent writes are ordered. Two clients writing the same key at W=2 can both succeed against different majorities at the same logical time; the database stores both as a conflict (sibling versions / last-write-wins by timestamp). Quorum is about visibility of a completed write, not about serialising racing writes — that is what LWT/Paxos or a consensus store is for.

Sloppy quorum + hinted handoff: trading the invariant for uptime

Strict quorum has a hard edge: if fewer than W of the designated replicas (the preference list) are reachable, the write fails. During a network partition or a multi-node outage that means downtime. Dynamo’s answer, inherited by Cassandra and DynamoDB, is the sloppy quorum: when a home replica is unreachable, the write goes to the next healthy node in the ring instead, which stores it as a hint — a parcel tagged “this really belongs to node X.” When X comes back, the holder replays the hint to it (hinted handoff) and deletes its copy.

This keeps writes flowing through failures, and that availability is the entire reason it exists. But read the fine print: the write was accepted by nodes that are not in the normal read set. A reader doing a strict R against the home replicas can completely miss a hinted write — so during the failure window, sloppy quorum does not satisfy R + W > N against the canonical N. The overlap guarantee is suspended exactly when you most want it. Worse, if the node holding the hint dies before handoff completes and you were at low durability, that write is simply lost, recoverable only by a slower anti-entropy repair if at all.

The production failure modes, and the repair that hides them

Three ways this bites in prod, all subtle because nothing errors:

W=1 lost write. The lone acking node fails before replicating. The write was “successful” and is now on zero live replicas. This is the Hook.
Sloppy-quorum stale read. During a partition, the write landed on hint-holders; a strict read on the home replicas serves the old value. Reads look healthy on both sides of the partition.
R + W <= N drift. R=1,W=2 (or any sub-overlap mix) lets a read pick the one replica that lagged. Intermittent, unreproducible “ghost stale data” tickets.

Together, these three modes share a pattern: each is correct-by-configuration, which is why no error fires. When you see a stale-data ticket with no exceptions in the logs, the first question to ask is whether R + W > N actually holds for that keyspace.

Two anti-entropy mechanisms paper over the gaps so the system is eventually consistent. Read repair: when a quorum read notices replicas disagree, the coordinator pushes the newest version to the stale ones inline — cheap, but only fixes keys you actually read. Anti-entropy repair (nodetool repair): replicas build Merkle trees, hash-tree diffs, and stream only mismatched ranges to reconcile data nobody has read. Skip scheduled repair within the tombstone GC window (gc_grace_seconds, default 10 days) and deleted data can resurrect — a separate, nasty class of bug.

Latency: a quorum read is only as fast as its slowest required replica

R is not just a correctness knob, it is a tail-latency knob. A QUORUM read at RF=3 waits for 2 of 3 replicas, so its latency is the second-fastest of three — meaning one slow replica (GC pause, hot partition, throttled disk) drags the read’s p99 even though the cluster as a whole is fine. Raising R to ALL makes you wait on the slowest of all replicas, multiplying tail risk. This is why DynamoDB strongly-consistent reads cost ~2x the throughput of eventually-consistent ones and run measurably slower, and why systems use request hedging — fire a duplicate request after a short delay and take whichever returns first — to clip the p99 that quorum’s “wait for the slowest needed replica” otherwise creates.

Pick the best fit

N=3 cluster. A balance/ledger read must never show a stale value, but you also need writes to survive a single node being down for a deploy. Pick R and W.

Quiz

At N=3, which (R, W) pair guarantees a read always sees the most recent successful write?

Quiz

Why can a sloppy quorum serve a stale read even when you configured W=2, R=2?

Order the steps

Order what happens when a strict quorum write can't reach a home replica and falls back to sloppy quorum + hinted handoff:

1 A designated replica for the key is unreachable (node down or partitioned)
2 The coordinator writes to the next healthy node on the ring instead, to still reach W acks
3 That node stores the value as a hint tagged 'belongs to node X'
4 Node X recovers; the hint holder replays the write to X (hinted handoff)
5 If a read raced the recovery, read repair / anti-entropy reconciles the stragglers

R + W > N (2 + 2 > 3) forces the write set and read set to share at least one node. That shared node (B) holds the latest write, so the read is guaranteed to see it.

Recall before you leave

01
A teammate wants to set writes to ONE (W=1) at RF=3 'just for the latency win.' Explain precisely what guarantee they give up and the concrete failure that follows.
02
Sloppy quorum keeps writes available during a partition — so why isn't it a free win, and what hides the problem in normal operation?

Recap

A quorum turns replication into a tunable promise built on one piece of arithmetic: if read and write replica counts satisfy R + W > N, the two sets are forced to intersect and the read is guaranteed to see the latest committed write. At N=3 the workhorse is W=2, R=2 (QUORUM / strongly-consistent read), which keeps the overlap while tolerating a single node down on both paths — the reason quorum beats ALL, since W=N or R=N hands all your availability to your unluckiest node. Tunable consistency means you set R and W per operation to slide along the consistency/availability/latency triangle, remembering that a quorum read waits on the slowest replica it needs, so R is also a p99 knob (hence request hedging and DynamoDB’s pricier strongly-consistent reads). Sloppy quorum with hinted handoff keeps writes flowing during failures by parking them on substitute nodes, but that suspends the overlap guarantee precisely during a partition — the seam where prod silently serves stale reads or, with W=1, loses an acknowledged write outright. Read repair and Merkle-tree anti-entropy converge the data afterward, but the senior lesson is to choose R and W to keep R + W > N for anything that must not be stale, and to know exactly which guarantee you forfeit the moment you tune below it. Now when you see a stale-read ticket with no errors in the logs, you will check the consistency levels and do the arithmetic before touching anything else.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.