Distributed Systems DIST · 05 · 01

Time and ordering: why wall-clock timestamps lie

Across a fleet, wall clocks drift and jump backward on NTP sync, so ordering writes by client timestamp silently drops data. Logical clocks order events without trusting the wall; Spanner''''s TrueTime buys ordering with hardware and a deliberate wait.

DIST Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

A user updates their shipping address, sees the success toast, refreshes — and the old address is back. No error, no log line, no exception. The Cassandra cluster used last-write-wins keyed on the client’s timestamp, and the node that handled the update had a clock running 1.2 seconds behind. So the “new” write arrived stamped earlier than the row it was supposed to replace. The database compared two numbers, kept the larger one, and discarded the user’s change as stale. It was telling the truth about the timestamps and lying about reality. By the end of this lesson you will know exactly why that happens, which clock primitives prevent it, and how to choose between them when the next conflict-resolution decision lands on your plate.

The wall clock is not a clock you can order by

When you look at two timestamps from different servers, they feel comparable — they’re both in milliseconds since epoch. But that feeling is wrong, and understanding why matters every time you design a distributed store. Every machine has a quartz crystal that drifts — typically tens of parts per million, so a free-running clock wanders by milliseconds per minute and seconds per day. NTP corrects this by periodically slewing or stepping the clock toward a reference. Stepping is the dangerous part: when the local clock is far enough off, ntpd jumps it, and that jump can go backward. For a few hundred milliseconds, “now” is earlier than a “now” you already read. Across a fleet of dozens of nodes, even healthy NTP leaves you with skew on the order of milliseconds; a failed NTP daemon or a VM paused by its hypervisor pushes that to seconds or minutes.

This is why a timestamp generated on machine A and a timestamp generated on machine B are not comparable as an ordering. They share units and nothing else. The moment your correctness depends on tsA < tsB meaning “A happened before B,” you have built on sand.

Last-write-wins is data loss waiting for a clock to slip

The address bug is the canonical production failure, and it is built into any system that resolves conflicts by comparing wall-clock timestamps: Cassandra, Riak, DynamoDB-style stores in LWW mode. The rule is “keep the write with the highest timestamp.” When clocks are skewed, the highest timestamp is not the latest write — it is the write from the node whose clock runs fast. The genuinely newer write loses, and crucially the client gets a success response. There is no error to catch, no metric to alert on. The data is simply gone, and you find out when a customer complains.

It gets worse with deletes. A tombstone written with a far-future timestamp — a buggy client, a node whose clock jumped forward — will suppress every real write below it until compaction garbage-collects the tombstone, which can be days. One slipped clock can erase a key for a week. Jepsen’s analysis put it bluntly: wall-clock timestamps are fundamentally unsafe ordering constructs.

▸Why this works

“Just run NTP” does not save you. NTP keeps skew small on average, but LWW is not an average — it is a worst case. You lose data on the one node, the one minute, where the clock slipped. And the failure is silent: the write returns 200 OK, the row is dropped, and nothing in your logs distinguishes it from a write that simply never happened.

Logical clocks: order events without trusting the wall

If you can’t trust physical time, what can you trust? The answer is causality — the question of whether one event caused another, not when each happened by a wall clock. The fix is to stop measuring physical time and start measuring causality. A Lamport timestamp is a single counter per process: increment on every local event, and on receiving a message set your counter to max(local, received) + 1. This guarantees that if event A happens-before B (A causally precedes B), then L(A) < L(B). It gives you a consistent total order for free, in O(1) space per event.

What Lamport clocks cannot do is tell you the converse: L(A) < L(B) does not mean A caused B. Two events on nodes that never communicated can be genuinely concurrent, yet Lamport hands them an arbitrary order anyway — it cannot detect concurrency, only impose an order. For a conflict-resolution system, that blindness is the whole problem: you need to know two writes were concurrent so you can merge them, not silently pick one.

Vector clocks recover that. Each node carries a vector — one counter per node in the system — and ships the whole vector with each message. Now you can compare two events: if every component of A is ≤ B’s and at least one is strictly less, A happened-before B; if neither dominates the other, they are concurrent, and the system can surface both versions (siblings) for the application to merge. The cost is the catch: O(n) space per stamped value, where n is the number of nodes, plus the metadata travels with every write. On a 200-node cluster that is 200 counters attached to data, and the vector grows as the cluster grows.

Approach	Detects concurrency?	Cost per stamp	Production gotcha
Wall-clock timestamp (LWW)	No	O(1), one number	Silent data loss on clock skew; no error returned
Lamport timestamp	No — only total order	O(1), one counter	Orders concurrent writes arbitrarily; can’t merge conflicts
Vector clock	Yes — flags concurrent events	O(n), one counter per node	Metadata grows with the cluster; needs pruning
TrueTime (Spanner)	N/A — gives real ordering	GPS + atomic clocks per datacenter	Commit-wait adds latency; needs special hardware

Only the vector clock detects concurrency in software; only TrueTime trusts physical time — and pays hardware and commit-wait latency for it.

TrueTime: pay hardware to make the wall clock honest

Google’s Spanner takes the opposite bet: instead of giving up on physical time, make it trustworthy and quantify your ignorance. TrueTime equips every datacenter with GPS receivers and atomic clocks, synced roughly every 30 seconds, and exposes TT.now() not as a single instant but as an interval [earliest, latest]. The width of that interval is the uncertainty ε — typically around 1ms (under 1ms at the 99th percentile in Google’s reported numbers, occasionally widening to single-digit milliseconds when a time master is slow to reach).

The clever part is commit-wait. Spanner assigns a commit timestamp s, then deliberately sleeps until TT.now().earliest > s — until it is certain the timestamp lies in the past everywhere. That wait is the size of ε, a few milliseconds, and it runs in parallel with Paxos replication so it often costs little wall time. The payoff is external consistency (linearizability across the whole database): if transaction T1 commits before T2 starts, T1’s timestamp is guaranteed smaller than T2’s — globally, with no skew caveat. Spanner converts clock uncertainty from a silent correctness bug into an explicit, bounded latency cost.

Pick the best fit

A multi-region key-value store needs to resolve concurrent writes to the same key without silently dropping a user's update. What ordering mechanism fits?

Quiz

A Cassandra cluster uses last-write-wins on client timestamps. One node's clock is 1.5s behind. A user's update routed through that node disappears after a refresh. Why did the client see no error?

Quiz

What does Spanner's commit-wait actually do, and what does it buy?

Order the steps

Order the chain of events that turns clock skew into silent data loss under last-write-wins:

1 One node's clock drifts (or NTP steps it) behind the rest of the fleet
2 A genuinely newer write is routed through that node and stamped with a past timestamp
3 Conflict resolution compares timestamps and keeps the existing (higher-timestamp) row
4 The newer write is discarded — but the client still receives a success response
5 Weeks later a customer reports the change vanished; no log or metric flagged it

The highest timestamp belongs to the fastest clock, not the latest write — so last-write-wins drops the real update with no error.

Recall before you leave

01
A teammate says 'we run NTP on every node, so last-write-wins by timestamp is safe.' Explain why that reasoning is wrong and what actually happens.
02
When would you reach for vector clocks over Lamport timestamps, and what does it cost you?

Recap

Across a fleet, wall-clock time is not an ordering primitive: quartz drift plus NTP steps (which can jump a clock backward) leave you with skew of milliseconds normally and seconds-to-minutes when something breaks. Any system that resolves conflicts by comparing client timestamps — last-write-wins in Cassandra, Riak, DynamoDB LWW mode — turns that skew into silent data loss, because the highest timestamp belongs to the fastest clock, not the latest write, and the client still gets a success. Logical clocks fix ordering by measuring causality instead of physical time: Lamport timestamps give a cheap O(1) total order but cannot detect concurrency, so they impose an arbitrary order on concurrent writes; vector clocks carry one counter per node, detect concurrency, and let the store keep conflicting siblings — at O(n) metadata that grows with the cluster. Google Spanner takes the third path: TrueTime quantifies clock uncertainty as a bounded interval ε (about 1ms via GPS and atomic clocks) and commit-wait deliberately sleeps out that window to deliver external consistency, converting a silent correctness bug into an explicit, bounded latency cost. Now when you see a distributed store resolving conflicts by comparing timestamps — in config, in a design doc, or in a postmortem — your first question is: what happens when one node’s clock drifts? If the answer is “silent data loss with a success response,” you know which primitive to reach for.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.Job schedulerA cron + backoff job runner with at-least-once delivery, idempotent handlers, and visibility timeouts — so no job is silently lost even when workers crash mid-execution.