awesome-everything RU
↑ Back to the climb

Observability

Traces and sampling: the cost model of distributed tracing

Crux How a distributed trace is built, why sampling is mandatory, and the engineering tradeoffs between head-based and tail-based strategies.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 13 min

A checkout request touches seven services before responding. Something takes 1.2 s total, but the metric says only “p99 is up.” A single distributed trace shows exactly which service’s span consumed 1.1 s of that time. Without sampling, storing that trace for every request at 1 000 req/s produces one billion spans per day.

How a trace is built

A request entering the edge is assigned a trace_id (128-bit, globally unique). Each service it touches creates a span:

  • operation_name
  • start_time, duration_ms
  • attributes (key-value pairs: http.route, db.system, error.type, etc.)
  • span_id (unique within the trace)
  • parent_span_id (links to the calling service’s span)

The trace_id and current span_id propagate forward via the W3C traceparent header: traceparent: 00-<trace_id>-<parent_span_id>-<flags>. The collector reconstructs the tree from spans; the UI renders a waterfall showing where wall-clock time was spent.

FieldWhat it identifiesExample
trace_idThe entire end-to-end request128-bit UUID, shared across all services
span_idOne operation within the trace64-bit, unique per service hop
parent_span_idWho called this spanSet to calling service’s span_id
traceparentW3C header propagating the context00-{trace_id}-{span_id}-01

Why sampling is mandatory

Storage is proportional to span count × traffic volume. A mid-size service emits 10–100 spans per request. At 1 000 req/s:

  • 10 spans/req × 1 000 req/s × 86 400 s = 864 M spans/day (minimum)
  • 100 spans/req × 1 000 req/s × 86 400 s = 8.64 B spans/day (complex service graph)

Uncompressed, each span is 1–5 KB. That is terabytes of trace data per day from a single service fleet. Storing 100% of traces would dwarf the original traffic volume and cost more than the product itself.

Head-based vs tail-based sampling

Head-based sampling decides at trace start whether to keep the trace. If the decision is “drop,” the SDK does not emit spans at all — zero collector cost, zero network bytes.

  • Typical rate: 1–10% of requests
  • Cost: predictable, low overhead
  • Weakness: uniform random sample under-represents rare events (errors, slow tails)

Tail-based sampling buffers spans until the trace completes, then decides based on full context — was there an error, did latency exceed a threshold?

  • Always keeps 100% of error traces and slow-tail traces
  • Drops successful fast traces at a low base rate (0.5–5%)
  • Cost: every span still flows through the collector even if eventually dropped — collector CPU and memory scale with raw traffic, not sampled volume

The production pattern combines both: head-based at 10–20% baseline to limit collector input; tail-based policies on top to keep 100% of errors and slow traces from what arrives.

Why this works

The W3C traceparent header’s sampled flag (the last byte: 01 = sampled, 00 = not sampled) is how the head-based decision propagates downstream. If a service sets the flag to 00, all downstream services honour the decision and do not emit spans, keeping collector load proportional to the sampling rate rather than raw traffic. This is what the Elastic Observability Labs post-mortem (2024) identified as the correct mechanism — and why naive head sampling using request-path-derived trace_ids can accidentally correlate the sampling decision with request properties.

Trace volume and sampling numbers
Spans per request (mid-size architecture)
10–100
Spans per day (1k req/s, 10 spans/req)
~864 M
Head-based sampling typical rate
1–10%
Tail-based: production sampling errors
100%
Tail-based: production sampling slow tail (p99+)
100%
Tail-based: production baseline successful
0.5–5%
Tail-sample buffer window
30–60 s per trace
OTLP wire overhead vs JSON
~50–70% smaller
Quiz

A service emits 100% of traces and stores them all. The bill triples within a week. What is the most common production fix?

Quiz

Why is tail-based sampling more expensive in collector overhead than head-based, even when both keep roughly the same final stored volume?

Quiz

A request enters the system with traceparent ending in '-00' (sampled flag = 0). A downstream service wants to record its span anyway. Which specification defines what must change for further downstream services to also record their spans?

Recall before you leave
  1. 01
    What are the four fields that link spans into a trace tree, and what does each one identify?
  2. 02
    Why is tail-based sampling more expensive in collector overhead than head-based, given the same final stored volume?
  3. 03
    A service emits 10–100 spans per request at 1 000 req/s. Estimate the daily span count and explain why 100% storage is not viable.
Recap

A distributed trace is a tree of spans connected by trace_id and parent_span_id, propagated across service boundaries via the W3C traceparent header. The collector reconstructs the tree; the UI shows a waterfall of service calls and their durations. Storage scales with span count times traffic — a mid-size service at 1 000 req/s produces hundreds of millions of spans per day, making 100% retention economically impossible. Head-based sampling is cheapest (zero collector cost for dropped traces) but misses rare error events; tail-based sampling keeps 100% of errors and slow-tail traces but buffers every span in collector memory until the trace completes, so its overhead scales with raw traffic not sampled volume. Production pattern: head-based at 10–20% combined with tail-based policies for errors and slow tails.

Connected lessons
appears again in167
Continue the climb ↑Join keys and exemplars: making the three signals compose
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.