awesome-everything RU
↑ Back to the climb

Distributed Systems

Time and ordering: why wall-clock timestamps lie

Crux Across a fleet, wall clocks drift and jump backward on NTP sync, so ordering writes by client timestamp silently drops data. Logical clocks order events without trusting the wall; Spanner''''s TrueTime buys ordering with hardware and a deliberate wait.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 17 min

A user updates their shipping address, sees the success toast, refreshes — and the old address is back. No error, no log line, no exception. The Cassandra cluster used last-write-wins keyed on the client’s timestamp, and the node that handled the update had a clock running 1.2 seconds behind. So the “new” write arrived stamped earlier than the row it was supposed to replace. The database compared two numbers, kept the larger one, and discarded the user’s change as stale. It was telling the truth about the timestamps and lying about reality.

The wall clock is not a clock you can order by

Every machine has a quartz crystal that drifts — typically tens of parts per million, so a free-running clock wanders by milliseconds per minute and seconds per day. NTP corrects this by periodically slewing or stepping the clock toward a reference. Stepping is the dangerous part: when the local clock is far enough off, ntpd jumps it, and that jump can go backward. For a few hundred milliseconds, “now” is earlier than a “now” you already read. Across a fleet of dozens of nodes, even healthy NTP leaves you with skew on the order of milliseconds; a failed NTP daemon or a VM paused by its hypervisor pushes that to seconds or minutes.

This is why a timestamp generated on machine A and a timestamp generated on machine B are not comparable as an ordering. They share units and nothing else. The moment your correctness depends on tsA < tsB meaning “A happened before B,” you have built on sand.

Last-write-wins is data loss waiting for a clock to slip

The address bug is the canonical production failure, and it is built into any system that resolves conflicts by comparing wall-clock timestamps: Cassandra, Riak, DynamoDB-style stores in LWW mode. The rule is “keep the write with the highest timestamp.” When clocks are skewed, the highest timestamp is not the latest write — it is the write from the node whose clock runs fast. The genuinely newer write loses, and crucially the client gets a success response. There is no error to catch, no metric to alert on. The data is simply gone, and you find out when a customer complains.

It gets worse with deletes. A tombstone written with a far-future timestamp — a buggy client, a node whose clock jumped forward — will suppress every real write below it until compaction garbage-collects the tombstone, which can be days. One slipped clock can erase a key for a week. Jepsen’s analysis put it bluntly: wall-clock timestamps are fundamentally unsafe ordering constructs.

Why this works

“Just run NTP” does not save you. NTP keeps skew small on average, but LWW is not an average — it is a worst case. You lose data on the one node, the one minute, where the clock slipped. And the failure is silent: the write returns 200 OK, the row is dropped, and nothing in your logs distinguishes it from a write that simply never happened.

Logical clocks: order events without trusting the wall

The fix is to stop measuring physical time and start measuring causality. A Lamport timestamp is a single counter per process: increment on every local event, and on receiving a message set your counter to max(local, received) + 1. This guarantees that if event A happens-before B (A causally precedes B), then L(A) < L(B). It gives you a consistent total order for free, in O(1) space per event.

What Lamport clocks cannot do is tell you the converse: L(A) < L(B) does not mean A caused B. Two events on nodes that never communicated can be genuinely concurrent, yet Lamport hands them an arbitrary order anyway — it cannot detect concurrency, only impose an order. For a conflict-resolution system, that blindness is the whole problem: you need to know two writes were concurrent so you can merge them, not silently pick one.

Vector clocks recover that. Each node carries a vector — one counter per node in the system — and ships the whole vector with each message. Now you can compare two events: if every component of A is ≤ B’s and at least one is strictly less, A happened-before B; if neither dominates the other, they are concurrent, and the system can surface both versions (siblings) for the application to merge. The cost is the catch: O(n) space per stamped value, where n is the number of nodes, plus the metadata travels with every write. On a 200-node cluster that is 200 counters attached to data, and the vector grows as the cluster grows.

ApproachDetects concurrency?Cost per stampProduction gotcha
Wall-clock timestamp (LWW)NoO(1), one numberSilent data loss on clock skew; no error returned
Lamport timestampNo — only total orderO(1), one counterOrders concurrent writes arbitrarily; can’t merge conflicts
Vector clockYes — flags concurrent eventsO(n), one counter per nodeMetadata grows with the cluster; needs pruning
TrueTime (Spanner)N/A — gives real orderingGPS + atomic clocks per datacenterCommit-wait adds latency; needs special hardware

TrueTime: pay hardware to make the wall clock honest

Google’s Spanner takes the opposite bet: instead of giving up on physical time, make it trustworthy and quantify your ignorance. TrueTime equips every datacenter with GPS receivers and atomic clocks, synced roughly every 30 seconds, and exposes TT.now() not as a single instant but as an interval [earliest, latest]. The width of that interval is the uncertainty ε — typically around 1ms (under 1ms at the 99th percentile in Google’s reported numbers, occasionally widening to single-digit milliseconds when a time master is slow to reach).

The clever part is commit-wait. Spanner assigns a commit timestamp s, then deliberately sleeps until TT.now().earliest > s — until it is certain the timestamp lies in the past everywhere. That wait is the size of ε, a few milliseconds, and it runs in parallel with Paxos replication so it often costs little wall time. The payoff is external consistency (linearizability across the whole database): if transaction T1 commits before T2 starts, T1’s timestamp is guaranteed smaller than T2’s — globally, with no skew caveat. Spanner converts clock uncertainty from a silent correctness bug into an explicit, bounded latency cost.

Pick the best fit

A multi-region key-value store needs to resolve concurrent writes to the same key without silently dropping a user's update. What ordering mechanism fits?

Quiz

A Cassandra cluster uses last-write-wins on client timestamps. One node's clock is 1.5s behind. A user's update routed through that node disappears after a refresh. Why did the client see no error?

Quiz

What does Spanner's commit-wait actually do, and what does it buy?

Order the steps

Order the chain of events that turns clock skew into silent data loss under last-write-wins:

  1. 1 One node's clock drifts (or NTP steps it) behind the rest of the fleet
  2. 2 A genuinely newer write is routed through that node and stamped with a past timestamp
  3. 3 Conflict resolution compares timestamps and keeps the existing (higher-timestamp) row
  4. 4 The newer write is discarded — but the client still receives a success response
  5. 5 Weeks later a customer reports the change vanished; no log or metric flagged it
Recall before you leave
  1. 01
    A teammate says 'we run NTP on every node, so last-write-wins by timestamp is safe.' Explain why that reasoning is wrong and what actually happens.
  2. 02
    When would you reach for vector clocks over Lamport timestamps, and what does it cost you?
Recap

Across a fleet, wall-clock time is not an ordering primitive: quartz drift plus NTP steps (which can jump a clock backward) leave you with skew of milliseconds normally and seconds-to-minutes when something breaks. Any system that resolves conflicts by comparing client timestamps — last-write-wins in Cassandra, Riak, DynamoDB LWW mode — turns that skew into silent data loss, because the highest timestamp belongs to the fastest clock, not the latest write, and the client still gets a success. Logical clocks fix ordering by measuring causality instead of physical time: Lamport timestamps give a cheap O(1) total order but cannot detect concurrency, so they impose an arbitrary order on concurrent writes; vector clocks carry one counter per node, detect concurrency, and let the store keep conflicting siblings — at O(n) metadata that grows with the cluster. Google Spanner takes the third path: TrueTime quantifies clock uncertainty as a bounded interval ε (about 1ms via GPS and atomic clocks) and commit-wait deliberately sleeps out that window to deliver external consistency, converting a silent correctness bug into an explicit, bounded latency cost. The senior instinct is simple: never trust the wall to order distributed writes.

Continue the climb ↑Clocks: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.