Engineering Practice ENG · 03 · 01

Code review: what humans catch, what tooling should, and why latency is the real cost

Review exists to catch design and intent, not style — automate style. The dominant cost is latency: a PR waiting a day blocks the author. And defect detection collapses past a few hundred lines, so big diffs get rubber-stamped.

ENG Junior ◷ 16 min

Level

FoundationsJuniorMiddleSenior

A 1,800-line PR lands Friday at 5pm: a new billing flow, a schema migration, and a refactor, all in one diff. Two reviewers leave four comments — a rename, a missing semicolon, a “nit: prefer const” — and approve. Three weeks later the migration double-charges a cohort of customers because a retry path wasn’t idempotent. Nobody saw it. The diff was too big to actually read, so the reviewers reviewed what was easy to see: trivia. The real flaw was in the design, and the design is exactly what a giant LGTM’d PR hides.

Review catches what tooling can’t

A senior reviewer’s scarce attention should go to the things a machine cannot judge: is this the right design, does it actually do what the ticket intended, will the next engineer understand it, is the error path correct, does it open a security hole. Google’s own standard puts design first — “the most important thing to cover in a review” — because design choices compound and are expensive to undo later, while a misplaced brace costs nothing.

Everything a tool can decide deterministically should never reach a human. Formatting, import order, naming conventions, lint rules, unused variables — run a formatter and a linter in CI and make them blocking. The moment a person types “nit: spacing,” you’ve spent the most expensive review resource you have (senior judgement) on the cheapest possible problem, and you’ve taught the author that review is about surface. Google’s engineering culture is explicit: automate style enforcement so style never reaches review, then train reviewers to spend their attention on correctness, security, and architecture.

Concern	Who should catch it	Why
Formatting, import order, spacing	Formatter (CI)	Deterministic; zero judgement; should never reach a human
Unused vars, obvious bugs, lint rules	Linter / static analysis (CI)	Machine-checkable; blocking gate before review starts
Is this the right design?	Human reviewer	Needs context, intent, system knowledge — no tool can judge it
Does it match the ticket’s intent?	Human reviewer	Correct code that solves the wrong problem still ships a bug
Maintainability, naming-for-meaning	Human reviewer	”Will the next dev understand this?” is a judgement call

Latency is the cost nobody measures

The number that matters most isn’t defect count — it’s turnaround. A PR is blocked work: the author can’t merge, often can’t start the next thing cleanly, and every hour it sits is an hour of someone’s progress frozen. The data is stark. Analysis across roughly 8 million pull requests found elite teams close the full review cycle in under 26 hours, while lagging teams let PRs sit a week before review even starts. The single biggest chunk of that time isn’t reviewing — it’s the initial pickup, the gap before anyone looks. Distributed teams pay another 8–16 hours per round just to timezone gaps.

That idle time has a price. The same analysis put the cost of review-bottleneck waiting at roughly 5.8 hours per developer per week, around $238K/year of frozen time for a ten-person team — and that’s waiting only, not the review itself. This is why DORA treats lead time as a core health metric: slow review directly degrades throughput, and a team optimising purely for “thorough review” while ignoring latency is trading a small defect win for a large flow loss.

▸Why this works

The fastest fix for review latency is almost never “review faster” — it’s “make PRs smaller.” A 50-line PR is picked up in minutes because it’s cheap to context-switch into; a 1,500-line PR sits because nobody has a free hour. Small PRs attack both problems at once: faster pickup and higher defect detection. Latency and quality are usually the same lever, not a tradeoff.

Small-PR economics: defect detection falls off a cliff

SmartBear’s study at Cisco — 2,500 reviews over 3.2 million lines of code, the largest of its kind — found the sweet spot is 200–400 lines of code per review. Inside that band, with the review spread over no more than 60–90 minutes, you get around a 70%+ defect yield. Push past 400 lines and detection drops sharply; the human stops reading and starts skimming. Pace matters too: reviewers going slower than ~400 LOC/hour were above average at finding defects, but past ~500 LOC/hour, defect density came in below average in 87% of cases.

This is the mechanism behind the rubber stamp. A reviewer faced with a 1,800-line diff doesn’t have the budget to hold it all in their head, so they do the only thing that fits: they comment on the things that are locally visible (a name, a style nit) and approve. The giant PR doesn’t get more scrutiny for being big — it gets less, because attention doesn’t scale with size and the reviewer’s cognitive limit is fixed. Big diffs are where real design flaws hide precisely because they’re the diffs nobody can fully read.

PR size	What actually happens	Defect detection
`LOC < 200`	Read fully, fast pickup, real design discussion	High
`200–400 LOC`	Sweet spot: 60–90 min, full attention	70–90% yield
`400–800 LOC`	Attention thins; skimming begins	Falling
`LOC > 800`	Rubber stamp: nits + LGTM, design unread	Collapses

Every review is a bet on where to spend a fixed budget. Crank thoroughness to the maximum and you slow the whole team’s flow; optimise purely for throughput and defects leak. The senior move is to spend depth where the risk is — the design, the data migration, the auth boundary, the money path — and to let low-risk diffs through fast. Google’s standard codifies this with a deliberately humane rule: approve once the change definitely improves overall code health, even if it isn’t perfect. “Not perfect” is not a reason to block; “makes the codebase worse” is.

The second axis is cultural: review as knowledge-sharing versus review as gatekeeping. Done well, review spreads context — the author learns the system, the reviewer learns the change, and the bus factor goes up. Done as a status game, it becomes bikeshedding (Parkinson’s Law of Triviality: people argue endlessly about the colour of the bike shed because it’s the only part they understand) and gatekeeping, where a reviewer blocks to assert control rather than to improve the code. The tell is a thread with twelve comments about naming on a PR whose actual design flaw shipped untouched. A review that nitpicks trivia while waving through a flawed design has inverted its entire purpose.

Pick the best fit

A teammate opens a 1,400-line PR bundling a feature, a refactor, and a migration. It's been blocking their next task for two days. As reviewer, what's the senior call?

Quiz

A reviewer leaves six comments on import ordering and variable spacing, then approves. What's the core problem?

Quiz

Per the SmartBear/Cisco study, what happens to defect detection as a single review grows past ~400 LOC?

Order the steps

Order a healthy review pipeline so human attention lands where it's worth most:

1 Formatter + linter run in CI and block — style/format never reaches a human
2 Keep the PR small (aim ~200–400 LOC) so it's readable in one sitting
3 Pick it up fast — initial latency is the biggest part of turnaround
4 Spend reviewer depth on the highest-risk parts: design, migration, auth, money path
5 Approve once it definitely improves code health — 'not perfect' isn't a block

CI filters the mechanical so reviewers spend their budget on design; changes loop back for another pass until approved.

Recall before you leave

01
A teammate says 'thorough review means catching everything, so bigger reviews should catch more bugs.' Explain why the data says the opposite.
02
Why is review latency, not defect count, often the metric a senior watches first — and what's the cheapest way to improve it?

Recap

Code review earns its keep on the things tooling can’t judge — design, intent, maintainability, the correct error and money paths — so push everything deterministic (format, lint, style) into a blocking CI gate and never spend a human on it. The cost nobody measures is latency: a PR in review is blocked work, elite teams close the cycle in under 26 hours while laggards let it sit a week, and the waiting alone burns ~5.8 hours per developer per week. Defect detection peaks at 200–400 LOC and collapses past ~400, which is why giant PRs get a nits-and-LGTM rubber stamp while their real design flaws ship untouched. The senior balance is to spend depth where the risk lives, approve once the change definitely improves code health rather than chasing perfection, and treat review as knowledge-sharing rather than a gatekeeping status game. Smaller PRs are the one move that improves quality and flow at the same time.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

unlocks

PR size sets review latency and detection at oncemiddle

deepens into

PR size sets review latency and detection at oncemiddle

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.