Performance PERF · 08 · 01

The performance loop: discipline, not a project

Performance regresses by default. The eight-step loop — observe, profile, classify, predict, fix, verify, enforce, repeat — is the discipline that keeps a service fast year over year.

PERF Junior ◷ 10 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A team fixed their p99 from 1.2 s to 200 ms. They called it done, shipped, and moved on. Six months later p99 is 900 ms again. No single regression — just new features, new libraries, a bigger JSON response from an upstream service. The fix was right. The discipline was missing.

By the end of this lesson you will know the exact mechanism that causes a team’s hard-won fix to quietly disappear — and the one step that prevents it.

Why performance regresses by default

Every new feature adds bytes, queries, or allocations. Every dependency upgrade ships new code paths. Every schema change can turn a fast query into a slow one. Without a mechanism to catch these additions, performance degrades continuously.

A one-time optimisation has an effective half-life of three to six months. After that window, the accumulated changes from new feature work undo the gains. Teams that treat performance as a project get there — then drift back. Teams that treat performance as a discipline stay there.

The difference is one mechanism: the loop.

The discipline is not more expensive — it is cheaper. It frontloads the cost into CI gates and observability instead of paying 20–40% in recurring firefights.

The eight-step performance loop

When you internalise these eight steps, any slowdown stops being a mystery and becomes a structured search. Here is the sequence every senior engineer runs:

Observe — a symptom surfaces: SLO burn, RUM regression, user complaint, dashboard alert. This tells you something is wrong, not what.
Profile — capture data appropriate to the symptom. CPU flame graph for CPU spikes, allocation profile for memory growth, network waterfall for slow page loads, bundle analyzer for client-side bloat.
Classify — name the bottleneck by family: CPU-algorithmic, allocation-bound, cache-bound, lock-bound, I/O-bound (N+1), syscall-bound (batching), JIT-deopt, bundle-bound. Each family has a known fix set.
Predict — use Amdahl’s law to estimate how much the headline metric will improve if you fix this hotspot. If the prediction is below your SLO target, this is not the right hotspot; return to step 2.
Fix — from the family’s playbook, pick the technique that matches the specific shape of the hotspot. Apply only the predicted change; no scope creep.
Verify — re-profile under the same load. Confirm both the local hotspot shrank AND the headline metric improved.
Enforce — add a CI gate, alert, or runbook entry that prevents this exact regression from returning.
Move on — find the next bottleneck. The loop never ends; it shifts between layers.

Together these eight steps form a closed feedback system: observe turns a vague symptom into a specific signal, enforce converts a one-time fix into permanent protection, and move-on keeps the cycle alive. Skip enforce, and step 5 is work you will repeat in six months.

Step	Action	Output feeds
1. Observe	Notice the symptom	Which service / metric to profile
2. Profile	Capture the right data stream	Hot function / span name
3. Classify	Name the bottleneck family	Fix playbook to pull from
4. Predict	Amdahl estimate of headline gain	Go/no-go on this hotspot
5. Fix	Apply the matching technique	Changed code / config
6. Verify	Re-profile under same load	Confirmed or reverted
7. Enforce	CI gate / alert / runbook	Regression-proof deploy
8. Move on	Find next bottleneck	Next iteration of step 1

The kitchen metaphor

Performance is like cleaning a kitchen, not painting a room. Painting once is fine. A kitchen cleaned once gets dirty as cooking happens; you clean continuously.

Each of the seven pieces in this chapter is a tool: profiler, hot-paths classifier, GC fixer, N+1 detector, batcher, bundle analyzer. None alone keeps the kitchen clean; the loop does.

▸Why this works

Teams without the loop end up with “why is the site slow now?” meetings every six months, each consuming 5 to 20 engineer-days. Teams with the loop have steady metrics year over year. The difference in total engineer-time is small — the discipline just frontloads the investment into CI gates and observability rather than deferring it to incident response.

Bea and Sven’s quarter

Bea joins a team where the service was fast a year ago. Now p99 is 1.2 s, up from 200 ms. Sven walks her through the loop: profile shows GC pressure at 18%, an N+1 in /orders adds 50 queries per request, /dashboard bundle grew 800 KB over six months. No single crisis — three separate slow accumulations.

They run the loop on each bottleneck one at a time: logger allocation fix (week 1), query deduplification (week 2-3), bundle code-split (week 4). After a month, p99 is 280 ms. CI gates keep the work alive through the next quarter of feature shipping.

Quiz

A team applied a performance fix and shipped. Six months later, performance is worse than before the fix. Most likely cause?

Order the steps

Order the eight steps of the performance loop a senior engineer runs every time:

1 Notice the symptom — SLO burn, RUM regression, profile alert
2 Open the profile — identify the hot path with concrete numbers
3 Classify the hotspot: CPU, allocation, cache, lock, I/O, syscall, JIT, bundle
4 Predict headline metric impact using Amdahl
5 Apply only the predicted change; no scope creep
6 Re-profile under same load; verify both local frame and headline metric improved
7 Add a CI gate or alert so this regression cannot return invisibly
8 Document and move to the next bottleneck

Complete the analogy

Fill in the blank: performance is the _______ of the codebase — measured continuously, enforced at every commit, owned by every engineer.

Quiz

What does it mean to treat performance as a 'loop' rather than a 'project'?

The loop never ends — it shifts between layers. Enforcement is the step that converts a one-time fix into a durable property.

Recall before you leave

01
Why does a one-time performance fix have a half-life of three to six months?
02
What is the role of the 'enforce' step, and why is it the most important of the eight?
03
In Bea and Sven's scenario, three separate bottlenecks accumulated over six months. What prevented the team from noticing each one as it appeared?

Recap

Performance regresses by default. Every new feature, dependency, and deploy adds bytes, queries, or allocations without anyone noticing. A one-time optimisation has a half-life of three to six months before accumulated changes undo the gains. The performance loop — observe, profile, classify, predict, fix, verify, enforce, repeat — converts the one-time fix into a durable property. The critical step is enforcement: CI gates that fail any PR reintroducing the same regression class. Teams without the loop reach a performance crisis every six to eighteen months and rebuild from scratch; teams with it maintain steady metrics year over year at a cost of five to ten percent of engineering time, versus twenty to forty percent in crisis mode. Now when you see p99 slowly climbing again after a fix, you will know exactly which step of the loop is missing — and what gate to add.

Connected lessons

builds on

deepens into

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.