Data Engineering DATA · 05 · 01

Event sourcing: the append-only log as source of truth

Store the immutable stream of state-changing events, not the current state — current state is a left-fold over the log. You buy audit, time-travel, and replay; you pay in versioning, GDPR, snapshots, and eventual consistency.

DATA Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A customer disputes a charge: “I never set my plan to annual.” Support pulls the row — plan: annual — and shrugs. But the accounts table only holds the current value; the UPDATE that set it overwrote whatever was there before, and the application logs rotated out two weeks ago. Nobody can say what the plan was on the 14th, who changed it, or whether a buggy webhook did it. In an event-sourced system this is a thirty-second query: replay the stream up to that timestamp and read the answer. The difference is not a better log — it’s that the log is the database.

State is a left-fold over events

A normal CRUD table stores the latest snapshot and destroys history on every UPDATE. Event sourcing inverts this: the append-only event store holds an ordered, immutable sequence of facts — AccountOpened, PlanChanged, CardDeclined — and the current state is just what you get by folding a reducer over them from the beginning. state = events.reduce(apply, initial). The events are the source of truth; the “current state” is a derived cache you can throw away and recompute at any time.

That single inversion is where every benefit and every cost comes from. You get a complete audit trail for free (the log is the audit). You get temporal queries — state as-of any past instant, by folding only the events before that timestamp. You get debugging by replay: copy the production stream into staging, run it through the new code, and watch the exact bug reproduce deterministically. None of this is bolted on; it falls out of never throwing data away.

The append-only constraint is load-bearing. You never UPDATE or DELETE an event. A mistake is corrected by appending a compensating event (PlanCorrected), the same way an accountant never erases a ledger entry — they post a reversing one. This is why event stores optimize for one thing: fast appends and fast sequential reads of a stream.

Event sourcing is not “Kafka” and not a change log

This is the distinction seniors get wrong most often. A change-data-capture log or a plain Kafka topic records events, but that alone is not event sourcing. The defining property is that the log is the authoritative source of truth and state is rebuilt from it — not a side-channel of changes emitted after a database already committed the truth elsewhere.

Kafka can serve as an event store, but with sharp caveats. Log compaction — Kafka’s headline feature — keeps only the latest value per key, which directly destroys the history event sourcing depends on; you must use retention-based topics with log.retention.ms = -1 (infinite), not compacted ones. Per-aggregate optimistic concurrency (“append only if I’m at version N”) has no clean primitive in Kafka, whereas a purpose-built store like EventStoreDB makes expectedVersion a first-class append condition. The common production pattern: keep the raw event topic forever as the source, and publish derived current-state to a separate compacted topic for read models — the compacted topic is a projection, never the truth.

Property	CRUD table	Change log / CDC	Event sourcing
Source of truth	Current row	The DB it tails	The event log itself
History	Lost on `UPDATE`	Often time-limited	Complete, forever
State as-of past time	Impossible	Hard / partial	Fold up to timestamp
Rebuild a new view	Backfill scripts	Limited by retention	Replay the whole log

CQRS: projections are disposable read models

You cannot serve a query like “show the dashboard” by folding the entire log on every request — that would be ruinously slow. So event sourcing almost always pairs with CQRS: the write side appends events; the read side runs projections that consume the stream and materialize purpose-built read models (a SQL table, an Elasticsearch index, a denormalized cache). Each projection is a tiny program: for every event, update its own table. Because projections are derived, they are disposable — drop the table, replay the log, get it back. Need a brand-new view six months in? Write a projection and replay history through it; the data was always there.

Two properties make this safe in production. First, projections must be idempotent: the same event may be delivered more than once (retries, at-least-once delivery), so applying it twice must equal applying it once. The standard mechanism is to track the last processed event version per stream and ignore any event whose version is <= the last seen. Second, the read side is eventually consistent: there is real lag between an event being appended and the projection catching up. EventStoreDB and similar stores document this read-model lag explicitly — the read model “converges to the correct state given time,” it is not guaranteed current at any instant.

Together these two properties mean you can safely replay, retry, and rebuild without corrupting the read model — but you must design each screen around the lag, because it is always there.

The append-only log (Event 1 → Event 2 → Event 3) is the source of truth. Current state is a left-fold over it; a read model is a separate projection over the same log. Both are derived — drop them and replay the events to rebuild.

▸Why this works

That lag is a UX problem, not just an infra one. A user clicks “Save”, you append the event, then redirect to a list rendered from the projection — which hasn’t caught up, so their change isn’t there. The naive fix (poll until it appears) leaks the architecture to the user. The real fix is to return the new state optimistically from the command result, or read-your-own-writes from the write side for that one screen, and let the projection settle behind the scenes.

When you first adopt event sourcing the benefits feel obvious; the costs only bite you later — when a schema changes, a user asks to be forgotten, or a stream grows to millions of events. This is where event sourcing earns its reputation. Schema versioning is unavoidable and permanent: you can never delete an old event shape, because old events written in that shape live in the log forever and must still be replayable. When OrderPlaced gains a field, every historical OrderPlaced lacks it. The standard answer is upcasting — a chain of transformation functions that lift an old event shape to the current shape at deserialization time, so the rest of your code only ever sees the latest version. Upcasters accrete; they are code you carry indefinitely.

GDPR’s “right to be forgotten” collides head-on with an immutable log. You legally must erase a user’s personal data, but the log is append-only and you cannot rewrite history without breaking every downstream replay and audit guarantee. The dominant technique is crypto-shredding: encrypt each user’s PII with a per-user key stored outside the log; to “forget” them, throw away the key, rendering the ciphertext permanently unrecoverable while the event structure stays intact. Note the sharp caveat seniors must flag: regulators may still treat undeletable encrypted PII as personal data, so crypto-shredding is a pragmatic mitigation, not a guaranteed legal slam-dunk — get counsel involved.

Finally, replay cost is unbounded if you do nothing. Folding millions of events to load one aggregate gets slow; a financial system replaying terabytes of price ticks can take minutes per rebuild. The fix is snapshotting: periodically persist the folded state at version N, then on load read the snapshot and replay only events after N. Snapshots are a pure performance optimization and a footgun — if a snapshot drifts from the events (logic changed, a write got lost mid-stream), it silently serves wrong state and masks the bug because replay never re-derives it. Defensive teams checksum snapshots and rebuild on mismatch.

Pick the best fit

A user exercises GDPR erasure. Their PII is embedded across hundreds of immutable events in the append-only store. Pick the approach a senior actually ships.

Quiz

Loading one aggregate now folds 4 million events and takes too long. What's the senior fix?

Quiz

Why must a projection that builds a read model be idempotent?

Order the steps

Order the lifecycle of a write in an event-sourced + CQRS system:

1 Command arrives; load the aggregate by folding its events (or snapshot + tail)
2 Validate the command against the current folded state
3 Append the resulting event(s) to the stream with an expected-version check
4 Projections consume the new event asynchronously and update read models
5 Queries read the (eventually consistent) read model, not the raw log

Recall before you leave

01
A colleague says 'we already publish events to Kafka, so we're event-sourced.' Explain why that may be false and what would actually make it event sourcing.
02
Walk through why schema versioning in event sourcing is permanent, and how upcasting handles it without rewriting history.

Recap

Event sourcing stores the immutable, append-only stream of state-changing events as the source of truth, and derives current state as a left-fold over that log; the “current state” is a disposable cache you can recompute at will. That single inversion buys a complete audit trail, temporal queries (state as-of any past instant), and deterministic debugging by replay — none bolted on, all falling out of never destroying data. It is distinct from a CDC change log or a plain Kafka topic, where a database elsewhere holds the truth; in true event sourcing the log is authoritative, and Kafka can only serve if you avoid compaction and keep infinite retention. CQRS pairs naturally: projections consume the stream into purpose-built, disposable read models that must be idempotent and are only eventually consistent, so reads lag the write side. The hard parts are permanent: you can never delete an old event shape, so schema changes are handled by upcasting; GDPR erasure against an immutable log is handled by crypto-shredding (with a real legal caveat); and unbounded replay cost is bounded by snapshotting, which itself becomes a footgun if a snapshot ever drifts from the events it claims to summarize. Now when you hear a teammate say “our Kafka topic is our source of truth,” you know exactly which question to ask first: can you drop the database and rebuild purely from that topic?

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Idempotent ETL PipelinePipelines don't fail gracefully — they fail at 3 a.m., halfway through a load, and someone re-runs them. This project teaches the one property that separates a hobby script from production data engineering: a run you can repeat any number of times and still land exactly one copy of each row. You'll build batch ingestion, an idempotent load, a watermark for incremental pulls, and the data-quality gates that stop bad data before it poisons everything downstream.