awesome-everything RU
↑ Back to the climb

Performance

Culture, economics, and org-scale performance

Crux Error budgets quantify the tradeoff between performance and feature velocity. 2x throughput halves the AWS bill. Cultural mechanics — PR criteria, EM OKRs, blameless retros — compound indefinitely.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

A VP of Engineering inherits an org with p99 at 800 ms and LCP at 3.5 s. Every quarter, there are two or three SLO-burning incidents, each consuming 20 to 40 engineer-days. She has 6 months and $500k. The question is not how to fix the current incidents — it is how to make performance a property of the org that outlasts her tenure.

Error budgets: the operational tradeoff

Google’s SRE book formalised performance and reliability as a continuous tradeoff via error budgets.

An SLO defines the target: “99.9% of /checkout requests under 200 ms over 30 days.” The error budget is the allowed shortfall: 0.1% = approximately 43 minutes per month. When the budget is healthy, the team can ship features faster (more risk tolerated). When the budget is exhausted, the team must focus on performance and reliability until it recovers.

This converts the “should we optimise or ship features?” argument into a quantitative tradeoff. Every release ships with a predicted error budget impact. Releases that would burn more than a set percentage of the remaining budget require explicit risk acceptance.

The budget ladder: Service-level SLOs sit at the top. Route-level budgets (per-page bundle size, per-endpoint query count, per-service allocation rate) sit in the middle. Feature-level budgets sit at the bottom. Every PR is accountable to the budget closest to it. When per-route and per-feature budgets are met, the headline SLO is met. When sub-budgets drift, the SLO is the first to suffer.

The economics of performance

Performance is a cost lever, not just a user-experience lever.

Infrastructure cost: a service with 2x better throughput needs half the infrastructure for the same load. AWS bills scale with vCPU, memory, and bandwidth. Concrete examples:

  • Discord’s structured logger rewrite cut per-request allocations 90%, dropping Go GC overhead from 20% to under 2%. Infrastructure cost reduced 40% for the chat service.
  • Shopify’s storefront LCP optimisation (bundle audit + lazy-load) restored LCP from 4.5 s to 1.9 s on mobile. Bounce rate dropped 12%, directly attributable to page speed.
  • Stripe’s server-side profile-first programme returns an estimated $5 to $10 in saved infrastructure cost for every engineer-hour invested.

Engineering velocity: teams with mature performance discipline spend 5 to 10% of engineering time on performance as steady-state maintenance. Teams without it spend 20 to 40% in crisis mode — mostly reactive. The difference is 15 to 30 percentage points of engineering capacity, permanently freed for product work.

Recruiting and retention: fast software is a competitive differentiator. Engineers who join teams known for performance discipline stay longer and produce more. The measurement is indirect but the correlation is strong across multiple company studies.

InvestmentReturnPayback window
Observability stack (~$500/mo OSS)MTTR cut 50–80%, incidents caught earlierFirst prevented incident
4 CI gates (week of eng time)90% of known regressions prevented at PR timeFirst quarter
2x throughput improvement (1–2 months eng)50% infra cost reduction for that workload3–6 months of saved cloud bill
Performance culture (ongoing)5–10% eng time on perf vs 20–40% crisis mode12–24 months

Toil reduction: converting firefighting to infrastructure

SRE’s toil framework asks: what manual work is repeated, automatable, and grows with scale? Performance firefighting is classic toil — page-out, manual triage, fix, repeat in three months. The loop converts toil into infrastructure.

A healthy team holds toil under 50% of engineering time per SRE’s guidance. Many teams sit at 70 to 80% pre-discipline and 20 to 30% post. The investment in observability, gates, and runbooks pays back not just in fewer incidents but in reclaimed engineer-time that is permanently redirected to product work.

Measure it: track the number of performance incidents per quarter and the average engineer-hours per incident. In Q1 of a mature programme, these drop 50 to 70% from the baseline. After 12 months, they stabilise at near-zero for known failure classes.

Distributed ownership: avoiding the bottleneck

The anti-pattern: centralise all performance work into one “performance team.” This team becomes a bottleneck — every product team waits for it, every regression is someone else’s problem until the crisis arrives.

The pattern that scales: each role owns its domain.

  • Frontend engineers: pieces 02 + 07 (hot paths + bundle budgets). Own per-route CWV.
  • Backend engineers: pieces 02 + 04 + 05 + 06 (hot paths + GC + N+1 + batching). Own per-endpoint latency and query budgets.
  • SRE / DevOps: piece 01 (profile-first infrastructure, continuous profiling). Build and maintain CI gates.
  • Platform engineers: piece 03 (cache vs big-O — fundamental patterns). Maintain shared observability stack.

The platform team builds the infrastructure that lets every team own their performance. Without distribution, performance degrades silently as product teams ship features without accountability. With distribution, every PR is checked against budgets, every team retros on regressions, and the platform team accelerates rather than blocks.

Cultural mechanics that make it stick

Three practices build durable culture:

1. Performance in every PR review, not a separate phase. The PR template includes a checklist item: “performance impact considered.” Code reviewers are trained to spot the seven-piece signals (lazy loading skipped, N+1 introduced, unnecessary allocations) and ask. Quarterly engineering surveys ask “did your reviewer flag performance?” — measures whether culture is sticking.

2. Engineering manager OKRs include performance. EM OKRs include “maintain or improve route SLOs” alongside delivery metrics. Senior engineer promotion criteria include “demonstrated performance improvements to systems they own.” Without this, engineers see performance work as career-distracting when it competes with feature velocity. With it, performance work is career-supporting.

3. Blameless retros after every performance incident, always ending with a new gate. Retro structure: what was the symptom, what was the root cause, what gate would have caught it earlier, who owns adding the gate. The accumulated CI gates and runbook entries become the team’s institutional memory — new engineers inherit it on day one instead of re-discovering the same failure modes.

Why this works

The hardest part of performance culture is making it self-sustaining after the initial push. The key mechanism is making performance “table stakes” — a property assumed in every PR, not argued about in every incident. Teams that reach this point typically have: (a) visible performance metrics on the engineering all-hands dashboard, (b) explicit gate failures in CI with clear owners, (c) a history of engineers being recognised for performance contributions in reviews. Without all three, performance culture decays within 12 to 18 months of the initial programme. With all three, it compounds.

Culture and economics numbers
Eng-time on perf (crisis mode, no discipline)
20–40%
Eng-time on perf (steady state, with discipline)
5–10%
Infrastructure cost reduction from 2x throughput improvement
~50%
Stripe infrastructure ROI per eng-hour on profiling
$5–10 saved
Error budget (0.1% SLO at 30 days)
~43 min/month allowed
Toil ratio pre-discipline (typical)
70–80%
Toil ratio post-discipline (target)
under 30%
Quiz

A team's error budget is 99.9% (0.1% shortfall allowed). After a deploy, p99 regressions consume 80% of the month's budget in 4 days. Senior response?

Quiz

Why does centralising all performance work in a dedicated 'performance team' fail to scale?

Recall before you leave
  1. 01
    How do error budgets convert the 'optimise vs ship features' argument into a quantitative tradeoff?
  2. 02
    Describe the three cultural practices that make performance discipline self-sustaining, and why each is necessary.
  3. 03
    What does 'distributed ownership' of performance mean, and why does it scale better than a centralised performance team?
Recap

Error budgets, introduced in Google’s SRE book, quantify the tradeoff between reliability and feature velocity. An SLO of 99.9% over 30 days gives 43 minutes of allowed degradation per month; when the budget is healthy, the team can ship faster; when it is exhausted, performance work takes priority. The economics of performance discipline are compelling: 2x throughput halves the AWS bill for that workload, continuous profiling returns $5–10 in infrastructure savings per engineer-hour at Stripe’s scale, and teams with mature discipline spend 5–10% of time on performance versus 20–40% in crisis mode. Cultural mechanics — PR review criteria, EM OKRs, blameless retros ending with new gates — are the highest-leverage investment because they compound indefinitely and survive team turnover. Distributed ownership prevents the centralised-team bottleneck: platform builds the infrastructure, every product team owns its layer, and performance becomes table stakes rather than a periodic crisis.

Connected lessons
appears again in260
Continue the climb ↑Performance capstone: multiple-choice synthesis
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources4
expand
  1. 01
  2. 02
  3. 03
  4. 04

Trademarks belong to their respective owners. Editorial reference only.