Engineering Practice ENG · 01 · 01

Red-green-refactor is a design loop, not a testing ritual

Writing the test first is design pressure, not coverage hygiene. You call your own API before it exists, so a bad interface shows up in seconds instead of after fifteen call sites. The payoff lives in the refactor step teams skip under deadline.

ENG Junior ◷ 15 min

Level

FoundationsJuniorMiddleSenior

Priya is asked to add a “cancel order” feature. She writes the test first and immediately hits a wall: to assert the outcome she has to construct an OrderService that needs a database connection, a payment gateway, an email client, and a feature-flag service — four collaborators just to test one boolean rule. The test is impossible to write cleanly. Her teammate would have noticed this only after wiring cancel() into a controller, two jobs, and an admin script. The unwritable test is the design review: OrderService is doing too much, and TDD said so before a single line of production code existed.

The first step is design pressure, not coverage

The loop is deliberately tiny: write the smallest failing test (red), write the least code that makes it pass (green), then improve the code without changing behavior (refactor). The value most people miss lives entirely in the red step. Writing the test before the implementation forces you to call your own code from the outside before it exists — you choose the function’s name, its arguments, its return shape, and its failure mode while you are still a consumer, not an author. Code that is awkward to call is awkward to test first, so a bad interface surfaces in seconds instead of after it has been wired into fifteen call sites that all now have to change.

This reframes the whole practice. People argue about TDD as if the question were “does writing tests first find more bugs?” That misses it. The mechanism is feedback latency on design: a hard-to-construct object, a method that needs six mocks to exercise, a function whose return value you can’t assert without reaching into internals — these are design smells, and test-first surfaces them at the cheapest possible moment, before the design has metastasized into the codebase. The arXiv comparative case study found the design effect was the consistent signal across teams; the bug-count effect was noisier.

Skipping refactor is how the loop rots

The most common failure is stopping at green. Under deadline pressure the ritual collapses to “test passes, ship it,” and the third step — refactor — quietly disappears. That is the step where duplication gets removed and tangled names get fixed while the tests still guard behavior. Skip it for a quarter and you have a green CI sitting on top of a codebase nobody wants to touch. Green is not health. The tests passing tells you the behavior is preserved; it tells you nothing about whether the code is still shaped like something a human can change.

There is a real cost here, and pretending otherwise is dishonest. Industry data puts test-first at roughly 15–22% more upfront time than test-after on the same feature. The trade is front-loaded design feedback and a regression net against later rework — which is why TDD’s return is highest on long-lived, frequently-changed code and weakest on throwaway scripts. We resolve that tension fully in lesson 04; for now the point is that the refactor step is where most of the long-term payoff actually lands, and it is the first thing teams cut.

Loop step	What you actually do	What it buys you	Cost of skipping it
Red	Write the smallest failing test, calling your API from outside	Design feedback before code exists; a bad interface fails here	You discover the bad API after 15 call sites depend on it
Green	Write the least code that passes — even something crude	Forces you to confront the real requirement, not a guess	Over-engineering for cases no test demands
Refactor	Clean names and duplication while tests stay green	Most of the long-term payoff; safe restructuring	Green CI over rotting code nobody will touch

Couple to behavior, or the suite becomes a liability

The quiet killer is what you assert. A test coupled to behavior — “a refund for a fully-shipped order is rejected” — survives any refactor that preserves the outcome, so it protects you while you improve the code. A test coupled to implementation — “RefundService calls inventoryClient.check() exactly once, then ledger.post()” — breaks the instant you reorganize internals, even when nothing the customer sees has changed. The team learns that red is usually noise, and starts ignoring it. That is precisely how a suite slides from asset to liability: the fragile-test anti-pattern documented in xUnit Test Patterns.

The mirror failure is just as dangerous: an implementation-coupled test can stay green while the real behavior is broken, because it only verifies that some internal dance happened, not that the right outcome was produced. Assert through the public surface, assert observable results, and mock only true external boundaries. A suite that breaks exactly when behavior breaks — and only then — is the one developers trust enough to act on.

▸Why this works

“Test first” sounds like a discipline about correctness, but its real product is a usable interface. When you are forced to consume your own API before writing it, you feel the friction your callers will feel — the four constructor dependencies, the method that can’t be called without a live database, the return value you can’t inspect. Those frictions are design defects, and the test is the first place they become concrete and cheap to fix. You are not writing a test; you are running a design review with yourself as the harshest user.

Pick the best fit

You write a test first and find you need to construct four collaborators (DB, payment, email, flags) just to exercise one cancel-order rule. What is the test telling you?

Quiz

What is the primary thing the 'red' step of red-green-refactor buys a senior engineer?

Quiz

A test asserts RefundService calls inventoryClient.check() once then ledger.post() once. Why is this the fragile-test anti-pattern?

Order the steps

Order the red-green-refactor loop the way a senior runs it for a new rule:

1 Red: write the smallest failing test, calling the not-yet-built API from outside
2 Feel the friction — if it's hard to construct, fix the design before continuing
3 Green: write the least code that makes the test pass
4 Refactor: clean names and remove duplication while tests stay green
5 Assert behavior, not internals, so the test survives the next refactor

Three-step cycle: write the smallest failing test (red), write the least code to pass (green), then clean without changing behavior (refactor) — and loop back to red for the next rule.

Recall before you leave

01
A colleague says 'TDD is just about getting coverage up — I'll write the tests after, same result.' What's the senior counter?
02
Why does 'assert behavior, not implementation' decide whether a suite is an asset or a liability?

Recap

Red-green-refactor is a design loop, not a coverage ritual. The red step is the load-bearing one: writing the test first makes you the first consumer of your own API, so a bad interface — an object that needs four collaborators, a method that can’t run without a database — fails the test in seconds instead of after fifteen call sites depend on it. That is design feedback at the cheapest possible moment, and the comparative case study found it the most consistent effect of TDD. The green step forces you to confront the real requirement; the refactor step is where most of the long-term payoff lands, because it removes duplication and fixes names while the tests still guard behavior — and it is the first step teams cut under deadline, leaving green CI over code nobody will touch. Test-first costs roughly 15–22% more upfront, which is why its return is highest on long-lived code and weakest on throwaway work. Above all, couple tests to behavior, never to implementation: implementation-coupled tests break on harmless refactors and can stay green while the real outcome is broken, training the team to ignore red. Assert observable outcomes through the public surface and the suite breaks exactly when behavior breaks.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

unlocks

Test doubles: London vs Detroit, and the over-mocking trapmiddle

deepens into

Test doubles: London vs Detroit, and the over-mocking trapmiddle

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

URL shortener at scaleBuild a URL shortener that survives real traffic — then run it: deploy it, watch it, and work the incident when one hot link melts your cache.