Engineering Practice
Red-green-refactor is a design loop, not a testing ritual
Priya is asked to add a “cancel order” feature. She writes the test first and immediately hits a wall: to assert the outcome she has to construct an OrderService that needs a database connection, a payment gateway, an email client, and a feature-flag service — four collaborators just to test one boolean rule. The test is impossible to write cleanly. Her teammate would have noticed this only after wiring cancel() into a controller, two jobs, and an admin script. The unwritable test is the design review: OrderService is doing too much, and TDD said so before a single line of production code existed.
The first step is design pressure, not coverage
The loop is deliberately tiny: write the smallest failing test (red), write the least code that makes it pass (green), then improve the code without changing behavior (refactor). The value most people miss lives entirely in the red step. Writing the test before the implementation forces you to call your own code from the outside before it exists — you choose the function’s name, its arguments, its return shape, and its failure mode while you are still a consumer, not an author. Code that is awkward to call is awkward to test first, so a bad interface surfaces in seconds instead of after it has been wired into fifteen call sites that all now have to change.
This reframes the whole practice. People argue about TDD as if the question were “does writing tests first find more bugs?” That misses it. The mechanism is feedback latency on design: a hard-to-construct object, a method that needs six mocks to exercise, a function whose return value you can’t assert without reaching into internals — these are design smells, and test-first surfaces them at the cheapest possible moment, before the design has metastasized into the codebase. The arXiv comparative case study found the design effect was the consistent signal across teams; the bug-count effect was noisier.
Skipping refactor is how the loop rots
The most common failure is stopping at green. Under deadline pressure the ritual collapses to “test passes, ship it,” and the third step — refactor — quietly disappears. That is the step where duplication gets removed and tangled names get fixed while the tests still guard behavior. Skip it for a quarter and you have a green CI sitting on top of a codebase nobody wants to touch. Green is not health. The tests passing tells you the behavior is preserved; it tells you nothing about whether the code is still shaped like something a human can change.
There is a real cost here, and pretending otherwise is dishonest. Industry data puts test-first at roughly 15–22% more upfront time than test-after on the same feature. The trade is front-loaded design feedback and a regression net against later rework — which is why TDD’s return is highest on long-lived, frequently-changed code and weakest on throwaway scripts. We resolve that tension fully in lesson 04; for now the point is that the refactor step is where most of the long-term payoff actually lands, and it is the first thing teams cut.
| Loop step | What you actually do | What it buys you | Cost of skipping it |
|---|---|---|---|
| Red | Write the smallest failing test, calling your API from outside | Design feedback before code exists; a bad interface fails here | You discover the bad API after 15 call sites depend on it |
| Green | Write the least code that passes — even something crude | Forces you to confront the real requirement, not a guess | Over-engineering for cases no test demands |
| Refactor | Clean names and duplication while tests stay green | Most of the long-term payoff; safe restructuring | Green CI over rotting code nobody will touch |
Couple to behavior, or the suite becomes a liability
The quiet killer is what you assert. A test coupled to behavior — “a refund for a fully-shipped order is rejected” — survives any refactor that preserves the outcome, so it protects you while you improve the code. A test coupled to implementation — “RefundService calls inventoryClient.check() exactly once, then ledger.post()” — breaks the instant you reorganize internals, even when nothing the customer sees has changed. The team learns that red is usually noise, and starts ignoring it. That is precisely how a suite slides from asset to liability: the fragile-test anti-pattern documented in xUnit Test Patterns.
The mirror failure is just as dangerous: an implementation-coupled test can stay green while the real behavior is broken, because it only verifies that some internal dance happened, not that the right outcome was produced. Assert through the public surface, assert observable results, and mock only true external boundaries. A suite that breaks exactly when behavior breaks — and only then — is the one developers trust enough to act on.
Why this works
“Test first” sounds like a discipline about correctness, but its real product is a usable interface. When you are forced to consume your own API before writing it, you feel the friction your callers will feel — the four constructor dependencies, the method that can’t be called without a live database, the return value you can’t inspect. Those frictions are design defects, and the test is the first place they become concrete and cheap to fix. You are not writing a test; you are running a design review with yourself as the harshest user.
You write a test first and find you need to construct four collaborators (DB, payment, email, flags) just to exercise one cancel-order rule. What is the test telling you?
What is the primary thing the 'red' step of red-green-refactor buys a senior engineer?
A test asserts RefundService calls inventoryClient.check() once then ledger.post() once. Why is this the fragile-test anti-pattern?
Order the red-green-refactor loop the way a senior runs it for a new rule:
- 1 Red: write the smallest failing test, calling the not-yet-built API from outside
- 2 Feel the friction — if it's hard to construct, fix the design before continuing
- 3 Green: write the least code that makes the test pass
- 4 Refactor: clean names and remove duplication while tests stay green
- 5 Assert behavior, not internals, so the test survives the next refactor
- 01A colleague says 'TDD is just about getting coverage up — I'll write the tests after, same result.' What's the senior counter?
- 02Why does 'assert behavior, not implementation' decide whether a suite is an asset or a liability?
Red-green-refactor is a design loop, not a coverage ritual. The red step is the load-bearing one: writing the test first makes you the first consumer of your own API, so a bad interface — an object that needs four collaborators, a method that can’t run without a database — fails the test in seconds instead of after fifteen call sites depend on it. That is design feedback at the cheapest possible moment, and the comparative case study found it the most consistent effect of TDD. The green step forces you to confront the real requirement; the refactor step is where most of the long-term payoff lands, because it removes duplication and fixes names while the tests still guard behavior — and it is the first step teams cut under deadline, leaving green CI over code nobody will touch. Test-first costs roughly 15–22% more upfront, which is why its return is highest on long-lived code and weakest on throwaway work. Above all, couple tests to behavior, never to implementation: implementation-coupled tests break on harmless refactors and can stay green while the real outcome is broken, training the team to ignore red. Assert observable outcomes through the public surface and the suite breaks exactly when behavior breaks.