Distributed Systems
CAP in practice: multiple-choice review
Six questions that cut across the whole unit. Each mirrors a decision you make in a real on-call incident — not an acronym to recite, but a tradeoff to defend while the network is silent.
Confirm you can connect the formal CAP definitions, the binary CP/AP partition choice, PACELC’s healthy-state latency tradeoff, and the production failure modes — the synthesis the lesson built toward.
A vendor markets its database as 'always available, fully consistent, partition tolerant — CAP-complete.' Why is this claim incoherent under the Gilbert-Lynch proof?
A 5-node etcd cluster splits 2-3 by a partition. The 2-node side rejects all reads and writes with errors; the 3-node side keeps serving. How do you classify this behaviour?
Two engineers both pick 'CP'. One says reads in another region now add 80 ms; the other insists CP changes nothing when the network is healthy. Who is right, and why?
Classify Google Spanner and Amazon DynamoDB (default) on the PACELC axis, and say what that predicts about each.
A healthy CP leader survives the network fine but suffers a 10-second stop-the-world GC pause; peers immediately trigger a leader re-election. What is the root cause class?
You run an AP store (Cassandra/DynamoDB style) and resolve concurrent writes with Last-Write-Wins on wall-clock timestamps. What is the hidden failure mode?
The through-line: P is not a choice, so a partition forces the binary CP-vs-AP decision — block/error to keep linearizability, or stay responsive and serve divergence. PACELC extends this to the healthy state, where strong consistency still costs latency. The production traps — vendor ‘CAP-complete’ claims, logical partitions from GC pauses, and LWW silently dropping writes under clock skew — all resolve back to the same discipline: use the formal definitions, classify your system honestly on PC/EC vs PA/EL, and tune to the workload’s real correctness needs.