Observability OBS · 01 · 05

Join keys and exemplars: making the three signals compose

How shared attribute names connect metrics, logs, and traces into a navigable surface, and how exemplars bridge aggregate metrics to individual traces.

OBS Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

A metric shows checkout p99 spiking. You open the log search — but your metric uses route="/checkout", the log uses http_path="/checkout", and the trace uses http.route="/checkout". Three different field names, three disconnected searches. The pivot that should take 10 seconds takes 20 minutes.

The join key problem

Three signals on three different backends are useless if you cannot navigate between them. The bridge is a small set of shared attribute keys that appear identically in metrics, logs, and traces. When they diverge — even subtly — every dashboard needs ad-hoc query translation and every on-call run-book is fragile.

The standard set, formalised in OpenTelemetry Semantic Conventions:

service.name — which service emitted this
trace_id + span_id — which request, which step
http.route — the templated route (e.g. /orders/{id}, not the full URL /orders/42)
http.request.method, http.response.status_code
db.system, db.operation, db.namespace
messaging.system, messaging.destination.name

When metrics carry the same service.name and http.route as log lines, and log lines carry the trace_id of the request that emitted them, and traces carry the same http.route attribute as the metric — that is when “click from metric spike to log line to trace” works in under 30 seconds. Without those join keys you have three disconnected dashboards and no narrative.

Join key	In metrics	In logs	In traces
service.name	label	field	resource attribute
trace_id	exemplar (sampled)	mandatory field	root identifier
http.route	label (templated)	field	span attribute
http.response.status_code	label (class)	field (full code)	span attribute (full code)

The whole payoff of join keys: one inconsistency turns a sub-30-second triage into a 20-minute multi-tab search.

Exemplars: the metric-to-trace bridge

An exemplar is a sampled trace_id attached to a histogram bucket observation. When http_request_duration_seconds records a 1.5 s observation in the bucket above 1 s, the histogram client optionally attaches one trace_id that produced that observation. Grafana renders these as dots on the histogram heatmap; clicking a dot opens the corresponding trace.

Exemplars (standardised in Prometheus 2.32, early 2022, now standard in OTel exporters) eliminate the gap that defined the pre-exemplar era: “metrics say something is slow but I have no example request to drill into.” The pivot is:

Metric histogram spike → click exemplar dot
Exemplar opens the trace for the specific slow request
Trace span shows db.query took 1.3 s
Log filtered to trace_id=<same> shows the full query text

No manual timestamp correlation. No searching across dashboards. One click chain. Steps 1–4 only work because each signal carries the same trace_id; without that shared join key, step 2 would require manually guessing which trace to look at from a timestamp window — and you would be wrong half the time.

OpenTelemetry: the integration layer

The OpenTelemetry project (CNCF, born 2019 from a merger of OpenCensus and OpenTracing) standardises three things:

API — the surface engineers write code against: tracer.start_span(), meter.create_counter().
SDK — the per-language implementation that builds OTLP messages.
OTLP — OpenTelemetry Protocol: binary protobuf over gRPC or HTTP, roughly 50–70% smaller than JSON over HTTP.

The pitch: instrument once, route anywhere. Application code calls the OTel API, the SDK builds OTLP messages, the OTel Collector forwards them to any backend — Honeycomb, Datadog, Jaeger, ClickHouse, Loki, Prometheus — anything that speaks OTLP or has a contrib exporter. Without OTel Semantic Conventions, every team invents their own field names and joins fail at integration time.

▸Why this works

The OTel Semantic Conventions document is the single most load-bearing piece of the observability stack. It is not the SDK or the wire format — those are plumbing. The field names are the semantics. Adopting them across all three signals — even if your storage backends are Prometheus + Loki + Jaeger — is the highest-leverage investment in observability hygiene because it is what makes the three pillars compose rather than fight.

The three signals each carry the same trace_id, so a metric spike, its log lines, and its trace are one click apart. An exemplar attaches a sampled trace_id to a slow histogram bucket, opening the exact trace behind the spike.

Order the steps

Walk the three-pillar triage for 'p99 checkout latency spiked in us-east-1 only':

1 Open the RED histogram filtered by region — confirm us-east-1 p99=1.4s, eu-west-1=90ms
2 Click the exemplar dot on the >1s histogram bucket
3 Open the trace — see 1.1s spent in the payment-gateway span
4 Filter logs to service=payment-proxy, region=us-east-1, around the trace timestamp
5 Read the log line: 'gateway upstream connect timeout, retrying'
6 Fix: add region as a metric label, add gateway.upstream.host as a span attribute

Quiz

A team's metrics use route='/checkout', logs use http_path='/checkout', and traces use http.route='/checkout'. What is the consequence?

Quiz

Why is OpenTelemetry Semantic Conventions load-bearing even for teams that stay on 1.0 backends (Prometheus, Loki, Jaeger)?

Recall before you leave

01
Name the five most important join keys for three-signal triage and where each appears.
02
Explain the exemplar click-through chain from a slow metric to a log line.
03
Why did OpenTelemetry emerge from a merger of OpenCensus and OpenTracing, and what does it standardise that neither predecessor did alone?

Recap

Three observability signals on separate backends are only useful when they share join keys — field names that appear identically in metric labels, log fields, and trace span attributes. OpenTelemetry Semantic Conventions defines these canonical names: service.name, trace_id, http.route, db.system, and others. Without them, pivoting from a metric spike to the matching log line requires manual translation; with them it is a single click. Exemplars extend this: a trace_id attached to a histogram bucket observation bridges an aggregate metric spike directly to one concrete slow request’s trace. OpenTelemetry (API + SDK + OTLP wire format) is the substrate that makes all three signals share these join keys in practice across any backend combination. Now when you instrument a new service, you will check that its metric labels, log fields, and span attributes all use the same OTel semantic convention names — because you know a single inconsistency turns a 30-second triage into a 20-minute search.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Traces and sampling: the cost model of distributed tracingmiddle

unlocks

Observability 2.0: wide events and the cost shiftsenior

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.