Observability OBS · 01 · 01

What the three signals are: logs, metrics, and traces

The three telemetry shapes production systems emit, why each exists, and how a five-minute triage works by picking the right one.

OBS Junior ◷ 10 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

An API’s p99 jumps from 80 ms to 1.2 s at 14:02 UTC. You have three different signals you can reach for. Choosing the wrong one first costs you an extra hour. Choosing the right one first costs you five minutes.

The three signals

Cindy Sridharan’s 2018 O’Reilly book Distributed Systems Observability crystallised the vocabulary: production systems emit three telemetry shapes.

Metrics — numeric measurements aggregated over time with a small fixed set of labels. Pre-aggregated, cheap to query across years of history, blind to any dimension you did not label in advance.
Logs — event records (timestamp + payload). Preserve every field you wrote, at the cost of ingestion bytes and query scan time.
Traces — the causal chain of one request across services. Preserve causality and timing per span, at the cost of sampling — storing every trace would multiply storage by an order of magnitude over the original traffic.

Together, each signal fills the blind spot of the other two: metrics see the trend but not the record, logs see the record but not the path, traces see the path but not the population. The three are not redundant. Each preserves something different and pays a different cost. A team with only one signal cannot triage end-to-end.

Metrics — aggregates & counters is anything wrong?

Logs — discrete events what exactly happened?

Traces — request spans where did time go?

Three complementary shapes: metrics raise the alarm, logs give the diagnostic, traces follow one request across services.

Signal	What it preserves	What it discards	Cost axis
Metrics	Aggregate trends, percentiles	Any unlabelled dimension	Cardinality (series count)
Logs	Every field in every event	Nothing — that is the problem	Ingestion bytes & retention
Traces	Causal chain of one request	Unsampled requests	Storage per span × traffic

The kitchen metaphor

A restaurant kitchen maps the three signals clearly. Metrics are wall dials — covers per hour, average ticket time, error rate. Cheap to read, always-on, but they only show what you labelled in advance. Logs are order tickets on the rail — every dish, every modifier, every table. Bulky, but when a customer complains you go to their ticket. Traces are the story of one specific order: prep at station A, sauté at B, plate-up, runner, table. Sampled — tracing every dish would bury you — but a cold dish lights the whole path.

The triage walk

Bea · Browser gets paged: checkout error rate jumped to 3%. She opens metrics — sees the spike at 14:02 UTC. The metric says “yes, errors are real” but not “why.” Sven · Origin server pulls logs filtered to status=5xx — sees a stack trace pointing at a payment-gateway timeout. Logs gave the cause. To know whether the request retried or wrote stale state, they pick a trace_id from the log line, open the span tree, and see the gateway timing out after 30 s with no retry. Three signals, three jobs, seconds each.

The rule: metrics are the alarm, logs are the diagnostic, traces are the deep dive.

▸Why this works

The three pillars exist because no single storage shape is cheap on all three of: long retention, high cardinality, and full request fidelity. Metrics excel at long retention; logs excel at high cardinality; traces excel at full request fidelity. Each one optimises for the questions its cost model handles cheaply, and offloads the rest to the other two.

Each signal is cheap on exactly one cost axis and pays for the other two — which is why one signal can never replace the other two.

Order the steps

Order the typical incident triage steps from cheapest signal to deepest:

1 Glance at metrics dashboards (RED, USE) — confirm the symptom is real
2 Open the alert source — identify the affected service and time window
3 Filter logs to errors in the window — read messages and stack traces
4 Pick one failed trace_id from a log line
5 Open the trace — see every span and where time was spent
6 If still unresolved, reach for profiles or replay the request

Quiz

A service emits a counter labelled by route, method, and status_class. Which of the three signals is this?

Quiz

Why do production teams sample traces instead of storing every one?

Complete the analogy

Fill in the blank: _______ are the wall dials — cheap to read, always-on, but they only show what you labelled in advance.

Recall before you leave

01
In two sentences, why are the three signals not interchangeable?
02
What is the triage order for an unknown production incident, and why?
03
Why does the kitchen metaphor work for the three signals?

Recap

The three production telemetry signals are metrics (aggregated numeric measurements with fixed labels), logs (timestamped event records preserving every field), and traces (causal chains of spans for individual requests). Each exists because no single storage shape is cheap on long retention, high cardinality, and full request fidelity simultaneously. In incident triage, metrics confirm the symptom, logs give the diagnostic message, and traces identify the exact span where time was lost. Now when you see an alert fire, you will instinctively reach for metrics first to confirm the symptom, then logs to read the cause, then a trace to see where time was spent — rather than spending an hour in the wrong signal.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

unlocks

Metrics and cardinality: the cost model of a time-series databasemiddle

deepens into

appears again in297

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.