Observability OBS · 08 · 10

Observability capstone: instrument a service and debug an incident

Capstone — instrument a multi-service system end-to-end with correlated logs, metrics, traces, and profiles, add an SLO burn-rate alert, then debug a real incident through the funnel.

OBS Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about the funnel is not the same as walking a real burn alert down to a git commit. Stand up a small two- or three-service system, instrument it with all four signals correlated by trace-id, wire an SLO burn-rate alert, then inject a fault and resolve it through the funnel — with evidence at every layer.

Goal

Turn the whole chapter into one working system: a service emitting correlated logs, metrics, traces, and profiles through OTel; an SLO with a multi-window burn-rate alert; trace propagation across every hop; continuous profiling; and a documented incident walked from page to root cause through the funnel.

Project

0 of 7

Objective

Build (or take) a small multi-service system — at least a frontend caller plus two backend services, one calling the other — instrument it end-to-end with OpenTelemetry so all four signals join by trace-id, define one SLO with a multi-window burn-rate alert, then inject a realistic fault and resolve it through the SLO → RED → trace → profile → git blame funnel, proving each step with captured evidence.

Requirements

Acceptance criteria

A single trace-id demonstrably joins all four signals for one chosen request — show the log line, the metric exemplar, the trace tree, and the profile, each carrying the same trace-id.
The multi-window burn-rate alert fires within minutes of the injected fault and clears after the fix; include the alert definition and screenshots/exports of fire and clear.
An incident write-up that walks SLO → RED → trace → profile → git blame, with one captured artefact per funnel layer, ending at the specific commit/function that caused the burn.
Trace propagation is verified across every hop (no orphaned spans, one connected trace tree), and tail sampling is shown to keep the failing/slow traces while dropping the baseline.
A blameless postmortem of about one page: timeline from T+0 to resolution, root cause stated as a system failure, and tracked action items.

Senior stretch

Add cost discipline: enforce a metric cardinality budget (reject unbounded labels like user_id in CI), apply tiered log retention, and report the per-million-requests observability cost before and after.
Add a second fault class (an error spike rather than a latency spike) and show the same funnel localises it — proving the order is fault-agnostic.
Run a scheduled game day: have a teammate inject an unknown fault while you are on-call, time your MTTD and MTTR, and update the runbook from what slowed you down.
Add a PII-scrubbing collector processor and a pre-commit secret/PII scan; demonstrate that an accidentally logged email or db.statement is redacted before it leaves the node.
Compute the ROI for your system: estimate revenue/min, multiply by the MTTR delta the funnel bought versus a no-trace baseline, and write the one-paragraph CFO answer.

Recap

This is the chapter as one working system and the loop you will run in every real incident: instrument all four signals and join them by trace-id, propagate the traceparent across every hop, define an SLO with a multi-window burn-rate alert, then walk SLO → RED → trace → profile → git blame to a specific commit — capturing evidence at each layer and closing with a blameless postmortem. Build it once on a toy system and the production version becomes muscle memory; the funnel is fixed even as the tools change.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.