Engineering Practice ENG · 01 · 02

Test doubles: London vs Detroit, and the over-mocking trap

Mockist (London) tests assert interactions; classicist (Detroit) tests assert state with real collaborators. Mock the boundaries — network, clock, payment — and use real objects inside. Over-mocking couples tests to call structure: they break on refactor and stay green on bugs.

ENG Middle ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A team mocks everything. Every collaborator in every unit test is a mock with expect(...).toHaveBeenCalledWith(...). Their suite is 2,400 tests, all green, 91% coverage. Then a refactor that splits one service into two — pure internal reorganization, identical behavior — turns 600 tests red in a single afternoon. None of them caught a bug; they caught the fact that the internal calls moved. Worse, a month earlier a real pricing bug shipped to production with the suite fully green, because the mocks returned canned values and nobody had asserted the actual computed total. The suite was measuring the wrong thing the whole time.

Five kinds of double, and the one that bites

“Mock” gets used as a catch-all, but xUnit Test Patterns names five distinct test doubles, and the distinction decides whether your test is robust or brittle. A dummy is a placeholder passed but never used. A stub returns canned answers to calls — it feeds state in. A fake is a working lightweight implementation (an in-memory repository standing in for Postgres). A spy records how it was called so you can inspect it after. A mock is pre-programmed with expectations: it asserts that specific calls happened, in a specific way, and fails the test if they didn’t.

The first four feed inputs or observe outputs; the mock is the one that asserts interactions, and that is where brittleness comes from. A mock bakes the production code’s call structure into the test’s pass/fail condition. The moment you change how the code achieves a result — even with identical observable behavior — the mock’s expectations no longer match and the test fails. Stubs and fakes don’t do this; they just supply data and let you assert the final state. Reaching for a mock when a stub would do is the single most common way teams manufacture fragile tests. Together, the five doubles give you a vocabulary: when you reach for the wrong one — typically a mock where a stub belongs — the suite accumulates brittleness invisibly until a harmless refactor turns hundreds of tests red.

London (mockist) vs Detroit (classicist)

The two schools are a real, named disagreement. The London / mockist school works outside-in: a unit is one class, you mock all its collaborators, and you test by asserting the interactions between objects — the right messages were sent in the right order. The Detroit / classicist (also Chicago) school works inside-out: a unit is a behavior that may span several real objects, you use real collaborators wherever you can, mock only what you must, and assert final state rather than calls.

The practical consequence is what survives a refactor. Classicist tests assert outcomes, so they afford ruthless refactoring — you can restructure internals freely and the test stays green as long as the result is right. Mockist tests assert call structure, so they pin the implementation: they catch design intent precisely but break whenever the internal collaboration changes. London’s strength is fast outside-in design feedback and tiny isolated units; its failure mode is exactly the 600-red-tests afternoon — a suite so coupled to structure that refactoring becomes prohibitively expensive.

Aspect	London / mockist	Detroit / classicist
Unit =	A single class	A behavior across real collaborators
Collaborators	Mocked	Real wherever possible
Asserts	Interactions (which calls happened)	Final state / output
Survives refactor?	Often breaks — pinned to call structure	Yes — only breaks on behavior change
Failure mode	Brittle suite, costly refactors	Harder to localize a failure

The boundary rule resolves it

The senior synthesis isn’t “pick a school.” It’s a boundary rule: mock what you can’t run, use real objects for what you can. Mock the things that are slow, nondeterministic, or have side effects you can’t undo — the network, the clock, the payment processor, the email sender, the third-party API. For collaborators you own and that run fast and pure, use the real thing or a fake; asserting final state over real internal objects gives you a test that breaks only when behavior breaks. This is mostly the classicist position with London’s discipline applied at the system edges, and it dissolves the over-mocking trap: you never mock a value object or a pure helper, because there is nothing to isolate from.

The decisive question for any double is: if I refactor internals without changing behavior, should this test fail? If the answer is no, you are using a mock where a stub or a real object belongs. Over-mocking is when that answer is “yes” for tests of code you control — and it is precisely what turns a green suite into 600 red tests that caught no bug, while the real pricing error slipped through because the mocks returned canned numbers no one ever checked against a real computation.

▸Why this works

The reason over-mocking feels productive is that it makes every test fast, isolated, and trivially deterministic — you control every input. But that control is the trap: a test where you stub every collaborator is, in the limit, a test that the code calls the methods you told it to call. It can’t catch an integration bug between two real objects, and it can’t catch a wrong result if you stubbed the result. The mocks turn the test into a mirror of the implementation, so it reflects every structural change as a failure and reflects none of the behavioral truth. Real collaborators are slower and harder to set up, but they are the only way the test can disagree with the code.

Pick the best fit

You're testing a PriceCalculator that uses a TaxRule object (pure, owned by you) and a CurrencyApi (third-party HTTP). How should you double each?

Quiz

What distinguishes a mock from a stub, and why does it matter for brittleness?

Quiz

A pure-refactor split of one service into two turns 600 mock-heavy tests red, none catching a bug. What's the root cause?

Order the steps

Order how a green-but-useless over-mocked suite degrades:

1 Every collaborator is mocked, including pure objects you own
2 Tests assert which calls happened, not the final result
3 Stubbed return values mean the real computation is never checked
4 A real bug ships to prod with the suite fully green
5 A behavior-preserving refactor turns hundreds of tests red for no bug

London (mockist) mocks all collaborators and asserts interactions — pinned to call structure. Detroit (classicist) uses real collaborators and asserts final state — survives refactors. The boundary rule synthesizes both: mock the network/clock/payment boundary, use real objects for code you own.

Recall before you leave

01
Explain London vs Detroit TDD and how the boundary rule resolves the disagreement.
02
How can an over-mocked suite be 91% green and still let a real bug ship while breaking on a no-op refactor?

Recap

“Mock” is a catch-all, but the five test doubles differ in ways that decide robustness: dummies, stubs, fakes, and spies feed inputs or observe outputs, while a mock asserts that specific interactions happened — and that assertion on calls is the source of brittleness. The London (mockist) school treats a unit as one class, mocks every collaborator, and asserts interactions; the Detroit/Chicago (classicist) school treats a unit as a behavior across real objects, mocks only what it must, and asserts final state. Classicist tests survive refactors because they check outcomes; mockist tests pin the implementation and break when collaboration changes, which is how a behavior-preserving refactor turns 600 tests red with no bug among them while a real pricing error ships green behind canned stub values. The senior resolution is the boundary rule: mock what you can’t run — network, clock, payment, email, third-party APIs — and use real objects or fakes for fast, pure code you own, asserting final state. Now when you see a wave of red tests after a pure refactor, ask whether the failures are detecting broken behavior or broken call structure — the answer tells you exactly which doubles to replace.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 7 done

Connected lessons

builds on

Red-green-refactor is a design loop, not a testing ritualjunior

unlocks

Property-based testing: invariants over examples, with shrinkingmiddle

deepens into

Property-based testing: invariants over examples, with shrinkingmiddle

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.