Engineering Practice ENG · 02 · 01

The integration-testing dilemma

Spinning every service up together works for two services and collapses for twenty: e2e is slow, flaky on any hop, and grows ~N² so teams mute it and catch nothing. The test pyramid breaks at the service boundary, where unit tests mock the very thing that breaks.

ENG Junior ◷ 15 min

Level

FoundationsJuniorMiddleSenior

At 02:14 the orders service starts throwing 500s. A provider team renamed a field from total_cents to amount_cents and shipped it — a clean, well-tested change, green on their own suite. Nothing in their CI knew that three downstream consumers read total_cents. The team had a “full integration environment,” but it had been red for nine days and everyone had stopped looking at it. The first real signal anyone got was a pager, in production, on a Friday. The fix took ten minutes. Finding out it had broken cost an outage — because the only test that could have caught it was the one nobody trusted anymore.

The intuitive answer doesn’t scale

Ask “will these two services still talk to each other?” and the obvious answer is: run them both, send a real request, check the response. That instinct is correct — for two services. The trouble is what happens as the system grows. With N services that call each other, the number of interaction pairs you’d want to exercise grows roughly with N², and an end-to-end test of any one journey needs every hop in that journey to be healthy at the same moment. A request that fans out through gateway → orders → pricing → inventory → payments only passes its e2e test when all five services, plus their databases and queues, are simultaneously up, migrated, and seeded with the right data.

That “all healthy at once” requirement is the quiet killer. Each service has, say, a 98% chance of being green in the shared environment at any moment. String five together and the journey is green only ~90% of the time; string twelve together and you’re under 80%. The environment isn’t broken because of your change — it’s broken because someone else’s migration is half-applied, or a seed script failed, or a dependency is mid-deploy. You inherit all of that noise on every run.

Slow, flaky, and therefore ignored

Those two forces — combinatorial setup and shared fragility — produce the three symptoms every team on a big e2e suite recognizes. Slow: booting a half-dozen services, a database, and a message broker before a single assertion runs pushes suites to 40+ minutes, so they move out of the inner loop and into a nightly job nobody watches. Flaky: because any hop can time out, a large fraction of failures — often cited around 1-in-5 runs — have nothing to do with the change under test. Ignored: a test that cries wolf four times for every real bug gets muted, retried-until-green, or quarantined. A muted test catches nothing, which is strictly worse than no test, because it still costs you the runtime and gives false comfort.

This is why mature platforms invert the classic advice. Netflix and Spotify famously reshaped the test “pyramid” into a honeycomb or diamond at the service layer: a thin cap of true end-to-end journeys (the guidance is roughly 5-10% of total tests), with the bulk of cross-service confidence pushed down into something faster and more isolated. The question this whole unit answers is: what is that faster, more isolated thing?

Test layer	What it boots	Speed / determinism	Catches the 02:14 rename?
Unit test	Nothing — the boundary is mocked	Milliseconds, deterministic	No — the mock returns the old shape
End-to-end	All services + DB + queue together	Minutes, flaky on any hop	Yes — if the env is green, which it isn’t
The missing layer	One side, against a recorded agreement	Seconds, deterministic	Yes — and at the author’s desk

The boundary is exactly where the pyramid has a hole

Here is the subtle part. The classic test pyramid says “lots of unit tests, fewer integration tests, very few e2e.” Inside a monolith that works, because a unit test exercises real in-process calls between modules. Across a network boundary it springs a leak: a unit test of the orders consumer mocks the pricing provider. The mock returns whatever shape the consumer’s author believed pricing returns. The day pricing renames a field, the consumer’s unit tests stay green — they’re asserting against the author’s stale belief, not against reality. The thing most likely to break (the wire contract between services) is the one thing unit tests deliberately stub out.

So you’re squeezed from both sides. Unit tests are fast and reliable but blind to the boundary by construction. End-to-end tests can see the boundary but are too slow and too flaky to gate every deploy. The renamed-field outage falls straight through the gap. What’s needed is a test that checks the boundary specifically — the shape and semantics of the requests and responses two services exchange — without booting both services together. That is the shape of the problem the rest of this unit solves. When you spot a cross-service failure that unit tests missed and e2e was too broken to catch, you’re already staring at this gap.

▸Why this works

“Just keep the integration environment green” sounds like a discipline problem, but it’s a structural one. A shared environment’s uptime is the product of every service’s individual uptime, so it degrades multiplicatively as you add services — and its health is owned by everyone, which means it’s owned by no one. Telling teams to try harder doesn’t change the math. The only durable fix is to stop requiring all services to be simultaneously healthy to learn whether two of them agree.

Pick the best fit

A platform has 18 services with dense HTTP dependencies. The e2e suite takes 45 minutes and fails ~20% of runs for reasons unrelated to the change. The team wants reliable cross-service compatibility feedback. What's the most sound direction?

Quiz

Why does an end-to-end suite's reliability degrade as you add more services to a journey?

Quiz

Why do unit tests fail to catch a provider renaming a field that a consumer reads?

Order the steps

Order how an e2e-only strategy decays into the 02:14 outage:

1 Two services integrate; a shared e2e environment is stood up to test them
2 More services join; the journey needs all of them healthy at once
3 Suite slows past 40 minutes and flakes ~1-in-5 for unrelated reasons
4 The team mutes or stops watching the env; it sits red for days
5 A provider renames a field; the only test that would catch it is muted; prod pages at 02:14

Unit tests mock the boundary so they stay green; e2e sees it but gets muted. The rename falls through the gap between both layers.

Recall before you leave

01
A colleague says 'our integration environment just needs more discipline to stay green.' Explain why that's a structural problem, not a discipline one.
02
Where exactly does the test pyramid 'break' at a service boundary, and why does that let a field rename reach production?

Recap

The instinct to test integration by running every service together is right for two services and wrong for twenty. End-to-end suites grow combinatorially: with N interacting services the pairs to exercise scale with N², and any one journey passes only when every hop is healthy at the same instant, so a shared environment’s reliability is the product of its parts and degrades multiplicatively as you add services. The result is the familiar trio — slow (40+ minute boots), flaky (~1-in-5 failures unrelated to the change), and therefore muted, and a muted test catches nothing while still costing runtime. The classic pyramid doesn’t save you, because at a network boundary unit tests mock the provider and assert against the author’s stale belief about its shape, leaving them green the day the provider renames a field. So you’re squeezed between a fast-but-blind layer and a sees-it-but-untrusted layer, and a cross-service rename falls through the gap into a 2 a.m. pager. What’s needed is a layer that checks the boundary itself — the shape and semantics two services exchange — without requiring both to be booted together. That layer is contract testing, and building it up is what the rest of this unit does. Now when you inherit a muted integration environment and a colleague asks “why can’t we just fix it?”, you’ll know it’s a structural math problem, not a discipline one — and you’ll know what to reach for instead.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

unlocks

Consumer-driven contracts: the consumer states the truthjunior

deepens into

Consumer-driven contracts: the consumer states the truthjunior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

URL shortener at scaleBuild a URL shortener that survives real traffic — then run it: deploy it, watch it, and work the incident when one hot link melts your cache.