awesome-everything RU
↑ Back to the climb

Observability

Join keys and exemplars: making the three signals compose

Crux How shared attribute names connect metrics, logs, and traces into a navigable surface, and how exemplars bridge aggregate metrics to individual traces.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 12 min

A metric shows checkout p99 spiking. You open the log search — but your metric uses route="/checkout", the log uses http_path="/checkout", and the trace uses http.route="/checkout". Three different field names, three disconnected searches. The pivot that should take 10 seconds takes 20 minutes.

The join key problem

Three signals on three different backends are useless if you cannot navigate between them. The bridge is a small set of shared attribute keys that appear identically in metrics, logs, and traces. When they diverge — even subtly — every dashboard needs ad-hoc query translation and every on-call run-book is fragile.

The standard set, formalised in OpenTelemetry Semantic Conventions:

  • service.name — which service emitted this
  • trace_id + span_id — which request, which step
  • http.route — the templated route (e.g. /orders/{id}, not the full URL /orders/42)
  • http.request.method, http.response.status_code
  • db.system, db.operation, db.namespace
  • messaging.system, messaging.destination.name

When metrics carry the same service.name and http.route as log lines, and log lines carry the trace_id of the request that emitted them, and traces carry the same http.route attribute as the metric — that is when “click from metric spike to log line to trace” works in under 30 seconds. Without those join keys you have three disconnected dashboards and no narrative.

Join keyIn metricsIn logsIn traces
service.namelabelfieldresource attribute
trace_idexemplar (sampled)mandatory fieldroot identifier
http.routelabel (templated)fieldspan attribute
http.response.status_codelabel (class)field (full code)span attribute (full code)

Exemplars: the metric-to-trace bridge

An exemplar is a sampled trace_id attached to a histogram bucket observation. When http_request_duration_seconds records a 1.5 s observation in the bucket above 1 s, the histogram client optionally attaches one trace_id that produced that observation. Grafana renders these as dots on the histogram heatmap; clicking a dot opens the corresponding trace.

Exemplars (standardised in Prometheus 2.32, early 2022, now standard in OTel exporters) eliminate the gap that defined the pre-exemplar era: “metrics say something is slow but I have no example request to drill into.” The pivot is:

  1. Metric histogram spike → click exemplar dot
  2. Exemplar opens the trace for the specific slow request
  3. Trace span shows db.query took 1.3 s
  4. Log filtered to trace_id=<same> shows the full query text

No manual timestamp correlation. No searching across dashboards. One click chain.

OpenTelemetry: the integration layer

The OpenTelemetry project (CNCF, born 2019 from a merger of OpenCensus and OpenTracing) standardises three things:

  • API — the surface engineers write code against: tracer.start_span(), meter.create_counter().
  • SDK — the per-language implementation that builds OTLP messages.
  • OTLP — OpenTelemetry Protocol: binary protobuf over gRPC or HTTP, roughly 50–70% smaller than JSON over HTTP.

The pitch: instrument once, route anywhere. Application code calls the OTel API, the SDK builds OTLP messages, the OTel Collector forwards them to any backend — Honeycomb, Datadog, Jaeger, ClickHouse, Loki, Prometheus — anything that speaks OTLP or has a contrib exporter. Without OTel Semantic Conventions, every team invents their own field names and joins fail at integration time.

Why this works

The OTel Semantic Conventions document is the single most load-bearing piece of the observability stack. It is not the SDK or the wire format — those are plumbing. The field names are the semantics. Adopting them across all three signals — even if your storage backends are Prometheus + Loki + Jaeger — is the highest-leverage investment in observability hygiene because it is what makes the three pillars compose rather than fight.

Order the steps

Walk the three-pillar triage for 'p99 checkout latency spiked in us-east-1 only':

  1. 1 Open the RED histogram filtered by region — confirm us-east-1 p99=1.4s, eu-west-1=90ms
  2. 2 Click the exemplar dot on the >1s histogram bucket
  3. 3 Open the trace — see 1.1s spent in the payment-gateway span
  4. 4 Filter logs to service=payment-proxy, region=us-east-1, around the trace timestamp
  5. 5 Read the log line: 'gateway upstream connect timeout, retrying'
  6. 6 Fix: add region as a metric label, add gateway.upstream.host as a span attribute
Quiz

A team's metrics use route='/checkout', logs use http_path='/checkout', and traces use http.route='/checkout'. What is the consequence?

Quiz

Why is OpenTelemetry Semantic Conventions load-bearing even for teams that stay on 1.0 backends (Prometheus, Loki, Jaeger)?

Recall before you leave
  1. 01
    Name the five most important join keys for three-signal triage and where each appears.
  2. 02
    Explain the exemplar click-through chain from a slow metric to a log line.
  3. 03
    Why did OpenTelemetry emerge from a merger of OpenCensus and OpenTracing, and what does it standardise that neither predecessor did alone?
Recap

Three observability signals on separate backends are only useful when they share join keys — field names that appear identically in metric labels, log fields, and trace span attributes. OpenTelemetry Semantic Conventions defines these canonical names: service.name, trace_id, http.route, db.system, and others. Without them, pivoting from a metric spike to the matching log line requires manual translation; with them it is a single click. Exemplars extend this: a trace_id attached to a histogram bucket observation bridges an aggregate metric spike directly to one concrete slow request’s trace. OpenTelemetry (API + SDK + OTLP wire format) is the substrate that makes all three signals share these join keys in practice across any backend combination.

Connected lessons
appears again in167
Continue the climb ↑Observability 2.0: wide events and the cost shift
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.