awesome-everything RU
↑ Back to the climb

Performance

The performance loop: discipline, not a project

Crux Performance regresses by default. The eight-step loop — observe, profile, classify, predict, fix, verify, enforce, repeat — is the discipline that keeps a service fast year over year.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 10 min

A team fixed their p99 from 1.2 s to 200 ms. They called it done, shipped, and moved on. Six months later p99 is 900 ms again. No single regression — just new features, new libraries, a bigger JSON response from an upstream service. The fix was right. The discipline was missing.

Why performance regresses by default

Every new feature adds bytes, queries, or allocations. Every dependency upgrade ships new code paths. Every schema change can turn a fast query into a slow one. Without a mechanism to catch these additions, performance degrades continuously.

A one-time optimisation has an effective half-life of three to six months. After that window, the accumulated changes from new feature work undo the gains. Teams that treat performance as a project get there — then drift back. Teams that treat performance as a discipline stay there.

The difference is one mechanism: the loop.

The eight-step performance loop

Every performance investigation, regardless of layer, follows the same structure:

  1. Observe — a symptom surfaces: SLO burn, RUM regression, user complaint, dashboard alert. This tells you something is wrong, not what.
  2. Profile — capture data appropriate to the symptom. CPU flame graph for CPU spikes, allocation profile for memory growth, network waterfall for slow page loads, bundle analyzer for client-side bloat.
  3. Classify — name the bottleneck by family: CPU-algorithmic, allocation-bound, cache-bound, lock-bound, I/O-bound (N+1), syscall-bound (batching), JIT-deopt, bundle-bound. Each family has a known fix set.
  4. Predict — use Amdahl’s law to estimate how much the headline metric will improve if you fix this hotspot. If the prediction is below your SLO target, this is not the right hotspot; return to step 2.
  5. Fix — from the family’s playbook, pick the technique that matches the specific shape of the hotspot. Apply only the predicted change; no scope creep.
  6. Verify — re-profile under the same load. Confirm both the local hotspot shrank AND the headline metric improved.
  7. Enforce — add a CI gate, alert, or runbook entry that prevents this exact regression from returning.
  8. Move on — find the next bottleneck. The loop never ends; it shifts between layers.
StepActionOutput feeds
1. ObserveNotice the symptomWhich service / metric to profile
2. ProfileCapture the right data streamHot function / span name
3. ClassifyName the bottleneck familyFix playbook to pull from
4. PredictAmdahl estimate of headline gainGo/no-go on this hotspot
5. FixApply the matching techniqueChanged code / config
6. VerifyRe-profile under same loadConfirmed or reverted
7. EnforceCI gate / alert / runbookRegression-proof deploy
8. Move onFind next bottleneckNext iteration of step 1

The kitchen metaphor

Performance is like cleaning a kitchen, not painting a room. Painting once is fine. A kitchen cleaned once gets dirty as cooking happens; you clean continuously.

Each of the seven pieces in this chapter is a tool: profiler, hot-paths classifier, GC fixer, N+1 detector, batcher, bundle analyzer. None alone keeps the kitchen clean; the loop does.

Why this works

Teams without the loop end up with “why is the site slow now?” meetings every six months, each consuming 5 to 20 engineer-days. Teams with the loop have steady metrics year over year. The difference in total engineer-time is small — the discipline just frontloads the investment into CI gates and observability rather than deferring it to incident response.

Bea and Sven’s quarter

Bea joins a team where the service was fast a year ago. Now p99 is 1.2 s, up from 200 ms. Sven walks her through the loop: profile shows GC pressure at 18%, an N+1 in /orders adds 50 queries per request, /dashboard bundle grew 800 KB over six months. No single crisis — three separate slow accumulations.

They run the loop on each bottleneck one at a time: logger allocation fix (week 1), query deduplification (week 2-3), bundle code-split (week 4). After a month, p99 is 280 ms. CI gates keep the work alive through the next quarter of feature shipping.

Quiz

A team applied a performance fix and shipped. Six months later, performance is worse than before the fix. Most likely cause?

Order the steps

Order the eight steps of the performance loop a senior engineer runs every time:

  1. 1 Notice the symptom — SLO burn, RUM regression, profile alert
  2. 2 Open the profile — identify the hot path with concrete numbers
  3. 3 Classify the hotspot: CPU, allocation, cache, lock, I/O, syscall, JIT, bundle
  4. 4 Predict headline metric impact using Amdahl
  5. 5 Apply only the predicted change; no scope creep
  6. 6 Re-profile under same load; verify both local frame and headline metric improved
  7. 7 Add a CI gate or alert so this regression cannot return invisibly
  8. 8 Document and move to the next bottleneck
Complete the analogy

Fill in the blank: performance is the _______ of the codebase — measured continuously, enforced at every commit, owned by every engineer.

Quiz

What does it mean to treat performance as a 'loop' rather than a 'project'?

Recall before you leave
  1. 01
    Why does a one-time performance fix have a half-life of three to six months?
  2. 02
    What is the role of the 'enforce' step, and why is it the most important of the eight?
  3. 03
    In Bea and Sven's scenario, three separate bottlenecks accumulated over six months. What prevented the team from noticing each one as it appeared?
Recap

Performance regresses by default. Every new feature, dependency, and deploy adds bytes, queries, or allocations without anyone noticing. A one-time optimisation has a half-life of three to six months before accumulated changes undo the gains. The performance loop — observe, profile, classify, predict, fix, verify, enforce, repeat — converts the one-time fix into a durable property. The critical step is enforcement: CI gates that fail any PR reintroducing the same regression class. Teams without the loop reach a performance crisis every six to eighteen months and rebuild from scratch; teams with it maintain steady metrics year over year at a cost of five to ten percent of engineering time, versus twenty to forty percent in crisis mode.

Connected lessons
appears again in260
Continue the climb ↑Classify and fix: matching bottleneck families to remedies
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.