Performance PERF · 08 · 04

Incident to enforcement: SLO burn to verified fix in 35 minutes

A complete worked example: an /checkout p99 SLO burn, diagnosed via continuous profiling and distributed traces, fixed in 35 minutes, and armoured with enforcement gates so the class of regression cannot return.

PERF Middle ◷ 14 min

Level

FoundationsJuniorMiddleSenior

An SLO burn alert fires at 10:47 am. Without continuous profiling and distributed traces, the on-call spends the afternoon guessing. With them, she has a root cause and a fix deployed by 11:22 am. The 35-minute window is not luck — it is the previous investment in observability paying off under pressure.

The incident: /checkout p99 burns the SLO

At 10:47 am, an alert fires: “checkout SLO p99 burn rate 14x — 12 hours of error budget consumed in 6 hours.”

The on-call engineer opens the unified dashboard. No deploys in the last hour. Traffic at baseline. This is not a spike — something changed in a downstream service.

Step 1: Observe — 60 seconds

Prometheus: /checkout p99 jumped from 240 ms to 1.4 s. Started approximately two hours ago. RPS flat. Error rate flat. Pure latency degradation.

RUM: client-side LCP unaffected. This is a server-side problem, not a bundle regression.

Step 2: Profile — 90 seconds

Pyroscope filtered to /checkout over the last two hours:

runtime.scanobject:   22%   (baseline: 4%)
runtime.mallocgc:      9%   (baseline: 2%)
encoding/json.Marshal: 24%  (baseline: 4%)
handlers.Checkout:      6%
pgx.Query:             12%

GC frames jumped from 6% to 31%. JSON Marshal jumped from 4% to 24%. Something is producing much more garbage and serialising much larger payloads than before.

Step 3: Classify — 60 seconds

Allocation-bound (family: GC / piece 04), with the serialisation overhead pointing at an unusually large response being produced, not just processed. This is the N+1 family pattern (piece 05) applied to inter-service payloads: something upstream started returning bulk data where a summary was sufficient.

Amdahl on the combined GC + Marshal frames (55%): 1 / (1 - 0.55) = 2.2x. That would bring p99 from 1.4 s to 636 ms — still above the 200 ms SLO. The root cause is not just “GC is hot”; it is whatever is causing the GC and Marshal pressure.

Step 4: Trace correlation — 5 minutes

Open Tempo, filter to slow /checkout traces in the last two hours. The waterfall shows:

[span] handle_request          1820ms
  [span] auth_check               5ms
  [span] order_query             340ms   (baseline: 12ms)
  [span] inventory_check         850ms   (baseline: 80ms)
  [span] payment_call            280ms   (baseline: 90ms)
  [span] serialise_response      340ms   (baseline: 8ms)

Every downstream span and the serialisation step degraded by the same multiplier. This is the pattern of an upstream service returning a response that is much larger than expected — every consumer step inflates proportionally.

Check deploy history: inventory-service rolled out at 10:31 am — 16 minutes before the SLO burn started. Diff: added include_full_sku_details: true to the /inventory response. Previously returned SKU IDs only; now returns full SKU objects. Response payload: 8 KB → 85 KB per call.

The /checkout service receives 85 KB, deserialises it (more allocations), selects 0.1% of the data it needs, then serialises its own response including the inflated inventory data. All three cost centres — GC, Marshal, downstream latency — trace to a single cause.

Signal	Observation	Implication
Pyroscope GC frames	6% → 31%	Much more garbage produced per request
Pyroscope json.Marshal	4% → 24%	Much larger payload being serialised
Tempo: inventory_check span	80ms → 850ms	Upstream service much slower or returning more
Deploy log: inventory-service	10:31 am deploy	Payload grew 8 KB → 85 KB per call
Combined	All degraded spans correlate	Single root cause: inventory-service deploy

Step 5: Fix — 15 minutes

Two parallel actions:

Roll back the inventory-service deploy via the deploy console.
Add a defensive check in /checkout: reject any inventory response over 64 KB with an error before processing.

Both ship within 15 minutes.

Step 6: Verify — 10 minutes

Pyroscope after rollback: scanobject returns to 4%, mallocgc to 2%, json.Marshal to 4%. Profile baseline restored.

Prometheus: /checkout p99 at 235 ms within 5 minutes of the rollback completing. SLO burn rate drops to 0.1x.

Both the local profile AND the headline metric confirmed.

Step 7: Enforce — the sprint after

Three enforcement actions, all taken in the following sprint:

A. PR gate: inter-service response size. Any PR to any service that changes a response schema must pass a contract test asserting the body size does not exceed 2x the current median. PRs that grow a response by more than 2x require explicit SRE review.

B. Production alert: per-endpoint payload size p99. Add a metric tracking p99 response body bytes per downstream call. Alert if it grows more than 50% week-over-week. First time this class of regression fires, the alert routes to the responsible team’s on-call before the SLO burns.

C. Runbook entry. “p99 spike on /checkout with all spans degraded proportionally → check upstream deploys in the last 2 hours for response-schema changes. Check inventory-service specifically.” The next on-call resolves this in under 5 minutes instead of starting from scratch.

Total elapsed from page to fix: 35 minutes. Total elapsed to enforcement: one sprint (1 week). Engineer-hours: incident response + postmortem + enforcement: 6 hours. Without continuous profiling + traces: estimated 4 to 8 hours for the incident alone.

The observability investment is the difference between a 35-minute resolve-and-enforce loop and an afternoon of guessing that recurs.

▸Why this works

The enforcement step is what separates discipline from firefighting. The incident consumed 35 minutes. Without the gate, the exact same class of regression — a different service’s response growing unexpectedly — will consume another 35 minutes in 3 to 6 months. With the gate, every future PR that changes a response schema is checked. The one-time cost of adding the gate is under 4 engineer-hours. The recurring cost it prevents is 35+ minutes per incident, potentially dozens of times.

Incident loop numbers

Time from page to verified fix (with full stack): 35 minutes
Time from page to root cause (without continuous profiling): 4–8 hours
Inventory-service response payload growth: 8 KB → 85 KB
p99 degradation on /checkout: 240ms → 1820ms
p99 after rollback (within 5 min): 235ms
Engineer-hours to add enforcement gates: ~4 hours
Average incidents prevented per enforcement gate: 3–8 per year

Quiz

In the /checkout incident, the profile showed GC and json.Marshal both spiking. A naive response would fix the allocation hotspot. Why is that wrong?

Quiz

Why does the enforcement step (adding CI gates and runbook entries) matter more than the fix itself?

Order the steps

Order the diagnostic steps in the /checkout incident from first observation to confirmed root cause:

1 SLO alert: p99 at 1.4s, RPS flat, RUM unaffected — server-side only
2 Profile shows GC frames 6% → 31%, json.Marshal 4% → 24%
3 Classify: allocation-bound + serialisation-bound; root cause must be upstream
4 Trace: every downstream span degraded proportionally — upstream bulk payload
5 Deploy log: inventory-service shipped at 10:31am with full SKU details in response
6 Rollback inventory-service; profile and p99 return to baseline within 5 minutes

Page to verified fix: 35 minutes. The enforcement gate is what stops the same class of upstream-payload regression from paging the next on-call.

Recall before you leave

01
Walk through how each of the five observability signals contributed to resolving the /checkout incident.
02
The /checkout incident combined family 04 (GC) and family 05 (N+1/bulk payload). How does cross-family identification change the fix strategy?
03
What are the three enforcement actions taken after the /checkout incident, and what class of regression does each prevent?

Recap

The /checkout incident ran from page to verified fix in 35 minutes because the five-signal observability stack was in place. Metrics named the service and severity. RUM ruled out client-side causes. Profiles showed GC and serialisation pressure — two effects of one cause. Traces showed all downstream spans degrading proportionally, pointing at an upstream service. Deploy history confirmed the inventory-service payload grew from 8 KB to 85 KB 16 minutes before the SLO burn. The fix was a rollback plus a defensive size check. The enforcement step — PR gate, payload-size alert, runbook entry — prevents the entire class of upstream payload regressions from becoming incidents in the future. This is what distinguishes a discipline from firefighting: the incident retro ends with a gate, not just a fix. Now when you see all downstream spans degrade proportionally in a trace, your first question is “what upstream changed?” — and your first action after the fix is “what gate prevents this class from returning?”

Connected lessons

deepens into

appears again in289

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.