Observability
Observability capstone: multiple-choice synthesis
Six questions that force you to wire the whole track together — three pillars, OTel, RED/USE, SLO budgets, trace propagation, profiling. Each one is a decision you make mid-incident or mid-design-review, not a definition to recite.
Confirm you can compose the chapter into one system: the funnel that drives MTTR, the trace-id that joins four signals, the sampling and cardinality levers that hold the bill, and the arithmetic that justifies the spend.
An on-call is paged 'checkout SLO burn 14x'. RED shows p99 Duration up, Rate and Errors flat. What is the correct next move in the funnel, and why that one?
A request slows down. RED Duration shows it, the trace shows inventory.lookup owns 1.3s, the profile shows json.Marshal ate the CPU, and the logs hold the error detail. What single property makes all four signals answer one query about THIS request?
A team head-samples traces at 5% at the SDK to cut cost. During an incident they cannot find traces for the failing requests. What went wrong, and what is the fix?
The observability bill doubled in a week with no traffic change. One team added user_id as a metric label and started logging full request bodies at INFO. Which two distinct failure modes is this, and which signal does each hit?
Two teams share identical tooling. Team A detects an outage in 30s and resolves in 20min; Team B detects in 8min and resolves in 12min. For the same severity, which team's users suffer less total impact, and what does that say about where to invest?
A CFO asks why observability costs $150k/year. The org runs ~30 incidents/year, each now resolved ~25 min faster than the pre-instrumentation baseline, at roughly $5k/min of revenue exposure. What is the strongest answer?
The chapter composes into one system. The funnel (SLO burn → RED → trace → profile → git blame) drives diagnosis, and each step is forced by the previous one. The trace-id propagated by W3C traceparent is what makes four signals answer one per-request query. Cost discipline is three independent levers — cardinality limits, tail sampling, tiered retention — each hitting a different signal. MTTD usually outweighs MTTR for user impact because the detection window is fully unmitigated. And the spend is justified by arithmetic: avoided-outage cost from real MTTR deltas, typically 10–30x. Every answer here is a drill-down from those through-lines.