Observability OBS · 01 · 10

Three pillars: build a navigable observability surface

Hands-on project — instrument one service with all three signals, wire join keys and exemplars, then prove a sub-30-second metric-to-log-to-trace triage and a cardinality guardrail.

OBS Senior ◷ 240 min

Level

FoundationsJuniorMiddleSenior

Reading about the three pillars is not the same as pivoting from a metric spike to the one slow trace in under 30 seconds. Instrument a real service with all three signals, wire the join keys that make them compose, and prove the triage walk — and the guardrails — with evidence at every step.

Goal

Turn the unit’s mental model into a working observability surface: emit metrics, logs, and traces from one service; connect them with OpenTelemetry Semantic Conventions and exemplars; demonstrate a single-click cross-pillar triage; and defend the metrics tier against a cardinality bomb.

Project

0 of 7

Objective

Take a small HTTP service (your own or a starter) and instrument it with all three signals through OpenTelemetry so that a single metric spike can be navigated to the exact slow request's trace and log lines in under 30 seconds — then prove the metrics tier survives a deliberately injected cardinality bomb.

Requirements

Acceptance criteria

A recorded triage walk that goes metric → trace → log in under 30 seconds, screenshots or terminal captures at each hop, all linked by the same trace_id.
Evidence that the three signals share identical join-key names (show the metric label, the log field, and the span attribute side by side for http.route and service.name).
The latency histogram shows clickable exemplars, and no metric carries a per-user or per-customer identity label.
A demonstration that the cardinality guardrail fires on the injected unbounded label (CI failure or alert + labeldrop), with the series-creation rate returning to baseline within one scrape interval.
A one-paragraph write-up naming, for each of the three signals, the question it answered cheapest during the triage and the cost axis it would have blown if misused.

Senior stretch

Add an on-call runbook page: the cheapest-signal-first triage order, the per-signal failure modes with their detection metrics, and the cardinality-budget review process.
Add a PII guardrail: a Vector or Fluent Bit redactor for common patterns (email, token, phone) on the log pipeline, plus an audit query that ranks high-cardinality string fields per signal.
Stand up a 2.0 wide-event path alongside the 1.0 stack: emit one wide event per request to a columnar store (ClickHouse/Honeycomb), reconstruct the same triage with GROUP BY + filter + trace_id join, and compare the developer experience and cost shape against the three-backend version.
Add a tail-sampling load comparison: measure collector CPU and memory under head-only vs head+tail sampling at the same traffic, and show that tail-sampling cost scales with raw traffic while stored volume does not.

Recap

This is the loop you run when you own a service’s observability: emit all three signals, wire them with OpenTelemetry Semantic Conventions join keys and exemplars so a metric spike navigates to the exact trace and log in seconds, sample traces so you never lose an error, and guard the metrics tier against cardinality bombs before they OOM the TSDB at 03:00. Doing it once on a small service makes the production version — and the 1.0-vs-2.0 cost decision — muscle memory.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.