Observability OBS · 01 · 04

Traces and sampling: the cost model of distributed tracing

How a distributed trace is built, why sampling is mandatory, and the engineering tradeoffs between head-based and tail-based strategies.

OBS Middle ◷ 13 min

Level

FoundationsJuniorMiddleSenior

A checkout request touches seven services before responding. Something takes 1.2 s total, but the metric says only “p99 is up.” A single distributed trace shows exactly which service’s span consumed 1.1 s of that time. Without sampling, storing that trace for every request at 1 000 req/s produces one billion spans per day.

How a trace is built

A request entering the edge is assigned a trace_id (128-bit, globally unique). Each service it touches creates a span:

operation_name
start_time, duration_ms
attributes (key-value pairs: http.route, db.system, error.type, etc.)
span_id (unique within the trace)
parent_span_id (links to the calling service’s span)

The trace_id and current span_id propagate forward via the W3C traceparent header: traceparent: 00-<trace_id>-<parent_span_id>-<flags>. The collector reconstructs the tree from spans; the UI renders a waterfall showing where wall-clock time was spent.

Field	What it identifies	Example
trace_id	The entire end-to-end request	128-bit UUID, shared across all services
span_id	One operation within the trace	64-bit, unique per service hop
parent_span_id	Who called this span	Set to calling service’s span_id
traceparent	W3C header propagating the context	`00-{trace_id}-{span_id}-01`

Why sampling is mandatory

Before you decide on a sampling strategy, it helps to understand exactly why 100% retention is off the table — the numbers make the decision obvious.

Storage is proportional to span count × traffic volume. A mid-size service emits 10–100 spans per request. At 1 000 req/s:

10 spans/req × 1 000 req/s × 86 400 s = 864 M spans/day (minimum)
100 spans/req × 1 000 req/s × 86 400 s = 8.64 B spans/day (complex service graph)

Uncompressed, each span is 1–5 KB. That is terabytes of trace data per day from a single service fleet. Storing 100% of traces would dwarf the original traffic volume and cost more than the product itself.

Head-based vs tail-based sampling

Head-based sampling decides at trace start whether to keep the trace. If the decision is “drop,” the SDK does not emit spans at all — zero collector cost, zero network bytes.

Typical rate: 1–10% of requests
Cost: predictable, low overhead
Weakness: uniform random sample under-represents rare events (errors, slow tails)

Tail-based sampling buffers spans until the trace completes, then decides based on full context — was there an error, did latency exceed a threshold?

Always keeps 100% of error traces and slow-tail traces
Drops successful fast traces at a low base rate (0.5–5%)
Cost: every span still flows through the collector even if eventually dropped — collector CPU and memory scale with raw traffic, not sampled volume

The production pattern combines both: head-based at 10–20% baseline to limit collector input; tail-based policies on top to keep 100% of errors and slow traces from what arrives.

Head-based decides at trace start and costs nothing for dropped traces, but misses rare errors. Tail-based buffers every span and keeps 100% of errors and slow traces — at collector overhead that scales with raw traffic, not kept volume.

▸Why this works

The W3C traceparent header’s sampled flag (the last byte: 01 = sampled, 00 = not sampled) is how the head-based decision propagates downstream. If a service sets the flag to 00, all downstream services honour the decision and do not emit spans, keeping collector load proportional to the sampling rate rather than raw traffic. This is what the Elastic Observability Labs post-mortem (2024) identified as the correct mechanism — and why naive head sampling using request-path-derived trace_ids can accidentally correlate the sampling decision with request properties.

Each bar is one span; a parent's bar fully spans its children in time (parent_span_id links them). The waterfall makes the slow recommendations span obvious. Head sampling decides at the root before children run; tail sampling buffers the whole tree, then keeps it because a span crossed the slow-latency policy.

Trace volume and sampling numbers

Spans per request (mid-size architecture): 10–100
Spans per day (1k req/s, 10 spans/req): ~864 M
Head-based sampling typical rate: 1–10%
Tail-based: production sampling errors: 100%
Tail-based: production sampling slow tail (p99+): 100%
Tail-based: production baseline successful: 0.5–5%
Tail-sample buffer window: 30–60 s per trace
OTLP wire overhead vs JSON: ~50–70% smaller

Quiz

A service emits 100% of traces and stores them all. The bill triples within a week. What is the most common production fix?

Quiz

Why is tail-based sampling more expensive in collector overhead than head-based, even when both keep roughly the same final stored volume?

Quiz

A request enters the system with traceparent ending in '-00' (sampled flag = 0). A downstream service wants to record its span anyway. Which specification defines what must change for further downstream services to also record their spans?

Recall before you leave

01
What are the four fields that link spans into a trace tree, and what does each one identify?
02
Why is tail-based sampling more expensive in collector overhead than head-based, given the same final stored volume?
03
A service emits 10–100 spans per request at 1 000 req/s. Estimate the daily span count and explain why 100% storage is not viable.

Recap

A distributed trace is a tree of spans connected by trace_id and parent_span_id, propagated across service boundaries via the W3C traceparent header. The collector reconstructs the tree; the UI shows a waterfall of service calls and their durations. Storage scales with span count times traffic — a mid-size service at 1 000 req/s produces hundreds of millions of spans per day, making 100% retention economically impossible. Head-based sampling is cheapest (zero collector cost for dropped traces) but misses rare error events; tail-based sampling keeps 100% of errors and slow-tail traces but buffers every span in collector memory until the trace completes, so its overhead scales with raw traffic not sampled volume. Production pattern: head-based at 10–20% combined with tail-based policies for errors and slow tails. Now when you see a tracing bill spike or an incident where the relevant trace is missing, you will know which of the two sampling layers failed — and which config knob to turn first.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

Logs and volume: the cost model of structured loggingmiddle

unlocks

Join keys and exemplars: making the three signals composemiddle

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.