awesome-everything RU
↑ Back to the climb

Observability

Head sampling and tail sampling: deciding which traces survive

Crux Head sampling is cheap but blind — it makes the keep/drop decision at trace start without seeing the outcome. Tail sampling sees the whole trace before deciding, catching errors and slow requests that head sampling drops, at the cost of collector RAM and routing discipline.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 13 min

Your service handles 10,000 requests per second. Storing every trace is prohibitively expensive. But if you sample randomly at 1%, the one slow request that triggered the customer complaint has a 99% chance of being silently discarded.

Head-based sampling

The keep/drop decision is made at trace start — encoded in the traceparent trace-flags 01 (sampled) bit. The decision is propagated to all downstream services, which honour it by default.

Common strategies:

  • Probabilistic: keep N% of traces (1–5% typical). Simple, predictable, scales linearly with traffic.
  • Rate-limiting: keep at most K traces per second regardless of traffic volume.

Cost: cheap — the decision is made once at the root span, before any work is done. Unsampled requests generate no spans at all, saving CPU, network, and storage.

Drawback: the decision is blind to the outcome. A slow or error-prone request that happens to be in the unsampled 99% is invisible in the tracing backend. If an incident hits 0.5% of traffic and you sample at 1%, you will keep roughly half the incident traces — but you might keep none if the incident is brief.

The sampled flag: when the flag is 01, downstream services record and export their spans. When it is 00, OTel SDKs create spans internally but do not export them by default. This is consistent sampling: either the whole trace is kept or none of it, never a fragment. A downstream service may override the incoming flag (for example, always sample its own errors), but partial overrides produce fragmentary traces that are nearly useless for debugging.

Tail-based sampling

The OTel Collector buffers all spans for a trace-id for a configurable decision window (30s–5min), then decides whether to keep the trace based on policies:

  • Error present → keep 100%.
  • Duration > threshold → keep 100%.
  • Specific attribute (e.g. user.tier=premium) → keep 100%.
  • Probabilistic top-up → keep 1% of the rest for baseline visibility.

Advantages:

  • Catches every error trace, even at 0.1% traffic rate.
  • Catches every slow trace above the latency threshold.
  • Provides the kind of selectivity that makes tail sampling the dominant pattern at high-traffic services.

Cost: the collector must hold every trace’s spans in memory until decision time.

Memory model: active_traces × avg_spans_per_trace × bytes_per_span. At 50,000 in-flight traces × 100 spans × 1 KB per span = 5 GB RAM. The decision window directly controls the RAM footprint.

Load-balancing exporter requirement: with multiple collector replicas, random span distribution scatters a trace’s spans across different instances. Each instance only sees fragments and cannot make a correct keep/drop decision. The solution is a load-balancing exporter that hashes by trace-id and routes all spans for one trace to the same collector instance. This is mandatory for tail sampling to work correctly.

DimensionHead samplingTail sampling
Decision timeAt trace start (head)After all spans complete (tail)
Sees outcome?NoYes (error, latency, attributes)
Collector RAMMinimalProportional to active traces × spans × span size
Routing requirementNone (stateless)Load-balancing exporter (trace-id hash)
Misses error traces?Yes (at rate = 1 − sample%)No (if error policy = 100%)

The hybrid pattern (dominant in production)

Head-sample at 100% (every request enters the pipeline), then tail-sample by policy:

  • Error traces → 100% keep.
  • Latency > 99th percentile threshold → 100% keep.
  • Everything else → 1% probabilistic.

This gives the volume control of head sampling and the selectivity of tail sampling, at the cost of one piece of additional infrastructure: the tail-sampling Collector tier with load-balancing exporter and sufficient RAM.

At 10k req/s with 30s decision window, 10 spans/trace, 1 KB/span: 10,000 × 30 × 10 × 1,024 B = ~3 GB collector RAM. Doable with 4–8 collector replicas.

Why this works

The hybrid pattern is why “we need to keep all error traces” and “we can’t afford to store everything” are not mutually exclusive. Head sampling enters every request without committing storage; the tail-sampling tier then applies the 100%-for-errors policy. Engineers who try to solve this with head sampling alone either store everything (expensive) or miss errors (unreliable). The two-tier design resolves both constraints.

Sampling cost reference numbers
Typical head sample rate
0.5–5% of traces
Typical tail decision window
30s–5 min
Tail-sampler RAM at 50k traces × 100 spans × 1 KB
~5 GB
Tail-sampler RAM at 10k req/s, 30s window, 10 spans, 1 KB
~3 GB
Load-balancing exporter: routing key
trace-id hash
Consistent sampling: trace is kept/dropped
100% or 0% — never a fragment
Quiz

A team chooses tail-based sampling so they can keep all error traces. What is the operational catch they must plan for?

Quiz

When traceparent arrives with the sampled flag set to `00`, what should the receiving service do by default?

Quiz

A tail-sampling Collector OOMs every few hours. The metrics show trace count is steady but spans-per-trace is growing. What is the likely cause?

Recall before you leave
  1. 01
    Why does head sampling miss error traces and what is the rate of missing?
  2. 02
    Explain the load-balancing exporter and why tail sampling breaks without it.
  3. 03
    Describe the hybrid head-100% + tail-policy pattern and when each tier acts.
Recap

Head sampling makes the keep/drop decision at trace start using the traceparent sampled flag, propagating the decision to all downstream services. It is cheap — unsampled requests generate no spans at all — but blind to outcomes: a 1% head rate drops 99% of error traces alongside 99% of normal ones. Tail sampling buffers all spans in the OTel Collector until the decision window closes (30s–5min), then applies policies: keep all errors, keep all slow traces, keep 1% baseline. The cost is collector RAM (active-traces × spans/trace × bytes/span) and a mandatory load-balancing exporter that routes all spans for one trace to the same collector instance. The hybrid head-100% + tail-policy pattern is the production standard: head at 100% feeds everything into the pipeline; the tail tier decides what to persist.

Connected lessons
Continue the climb ↑Sampling consistency and the tail-sampling Collector tier
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.