awesome-everything RU
↑ Back to the climb

Observability

What the three signals are: logs, metrics, and traces

Crux The three telemetry shapes production systems emit, why each exists, and how a five-minute triage works by picking the right one.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at junior altitude — the surface
◷ 10 min

An API’s p99 jumps from 80 ms to 1.2 s at 14:02 UTC. You have three different signals you can reach for. Choosing the wrong one first costs you an extra hour. Choosing the right one first costs you five minutes.

The three signals

Cindy Sridharan’s 2018 O’Reilly book Distributed Systems Observability crystallised the vocabulary: production systems emit three telemetry shapes.

  • Metrics — numeric measurements aggregated over time with a small fixed set of labels. Pre-aggregated, cheap to query across years of history, blind to any dimension you did not label in advance.
  • Logs — event records (timestamp + payload). Preserve every field you wrote, at the cost of ingestion bytes and query scan time.
  • Traces — the causal chain of one request across services. Preserve causality and timing per span, at the cost of sampling — storing every trace would multiply storage by an order of magnitude over the original traffic.

The three are not redundant. Each preserves something different and pays a different cost. A team with only one signal cannot triage end-to-end.

SignalWhat it preservesWhat it discardsCost axis
MetricsAggregate trends, percentilesAny unlabelled dimensionCardinality (series count)
LogsEvery field in every eventNothing — that is the problemIngestion bytes & retention
TracesCausal chain of one requestUnsampled requestsStorage per span × traffic

The kitchen metaphor

A restaurant kitchen maps the three signals clearly. Metrics are wall dials — covers per hour, average ticket time, error rate. Cheap to read, always-on, but they only show what you labelled in advance. Logs are order tickets on the rail — every dish, every modifier, every table. Bulky, but when a customer complains you go to their ticket. Traces are the story of one specific order: prep at station A, sauté at B, plate-up, runner, table. Sampled — tracing every dish would bury you — but a cold dish lights the whole path.

The triage walk

Bea · Browser gets paged: checkout error rate jumped to 3%. She opens metrics — sees the spike at 14:02 UTC. The metric says “yes, errors are real” but not “why.” Sven · Origin server pulls logs filtered to status=5xx — sees a stack trace pointing at a payment-gateway timeout. Logs gave the cause. To know whether the request retried or wrote stale state, they pick a trace_id from the log line, open the span tree, and see the gateway timing out after 30 s with no retry. Three signals, three jobs, seconds each.

The rule: metrics are the alarm, logs are the diagnostic, traces are the deep dive.

Why this works

The three pillars exist because no single storage shape is cheap on all three of: long retention, high cardinality, and full request fidelity. Metrics excel at long retention; logs excel at high cardinality; traces excel at full request fidelity. Each one optimises for the questions its cost model handles cheaply, and offloads the rest to the other two.

Order the steps

Order the typical incident triage steps from cheapest signal to deepest:

  1. 1 Glance at metrics dashboards (RED, USE) — confirm the symptom is real
  2. 2 Open the alert source — identify the affected service and time window
  3. 3 Filter logs to errors in the window — read messages and stack traces
  4. 4 Pick one failed trace_id from a log line
  5. 5 Open the trace — see every span and where time was spent
  6. 6 If still unresolved, reach for profiles or replay the request
Quiz

A service emits a counter labelled by route, method, and status_class. Which of the three signals is this?

Quiz

Why do production teams sample traces instead of storing every one?

Complete the analogy

Fill in the blank: _______ are the wall dials — cheap to read, always-on, but they only show what you labelled in advance.

Recall before you leave
  1. 01
    In two sentences, why are the three signals not interchangeable?
  2. 02
    What is the triage order for an unknown production incident, and why?
  3. 03
    Why does the kitchen metaphor work for the three signals?
Recap

The three production telemetry signals are metrics (aggregated numeric measurements with fixed labels), logs (timestamped event records preserving every field), and traces (causal chains of spans for individual requests). Each exists because no single storage shape is cheap on long retention, high cardinality, and full request fidelity simultaneously. In incident triage, metrics confirm the symptom, logs give the diagnostic message, and traces identify the exact span where time was lost. The next lessons unpack the cost model and mechanics of each signal in depth.

Connected lessons
appears again in268
Continue the climb ↑Metrics and cardinality: the cost model of a time-series database
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.