Observability OBS · 06 · 05

Sampling consistency and the tail-sampling Collector tier

Consistent sampling uses deterministic trace-id hashing so all services agree on keep/drop without coordination. The tail-sampling Collector tier requires a load-balancing exporter, a RAM budget model, and an explicit cap to prevent OOMs from long-running traces.

OBS Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

A tail-sampling Collector OOMs every few hours during peak traffic. The team bumps the memory limit and it OOMs again. Nobody asked: why does memory grow, what controls it, and what caps it?

Sampling consistency across services

If you already have head and tail sampling, why does consistent sampling matter as a separate concern? Because without it, service A might keep a trace while service B drops it — and the backend stitches together a fragment that looks like a complete trace but is missing half the story. “Consistent sampling” means all spans of one trace are either all-sampled or all-dropped. No partial traces — either the backend has a complete trace or it has nothing.

The mechanism: probabilistic head-samplers use a deterministic hash of the trace-id modulo 100 to decide which traces to keep. Because the hash is deterministic and the trace-id is the same across all services (propagated by traceparent), every service independently arrives at the same keep/drop decision. No coordination is needed.

W3C Trace Context Level 2 (2024) formalises this with the random-trace-id flag (bit 1, value 02): when set, the trace-id is guaranteed to be uniformly random, making the hash-based consistent sampling safe to rely on. Consistent hash-based samplers read this flag as a prerequisite; if it is absent, they may fall back to a simpler approach.

Why partial traces are worse than no traces:

A trace missing service B’s span looks like service A called service C directly. Latency attribution is wrong.
Percentile calculations over partial traces produce systematic bias — short spans are over-represented.
On-call engineers diagnose based on what they see; partial traces lead to confident wrong diagnoses.

The tail-sampler memory model

The tail-sampling Collector holds all spans of every active trace in memory until the decision window closes. Memory usage:

RAM = active_traces × avg_spans_per_trace × bytes_per_span
    = (in_flight_request_rate × decision_window_seconds) × avg_spans × span_size

At 10k req/s, 30s window, 10 spans/trace, 1 KB/span: 10,000 × 30 × 10 × 1,024 = ~3 GB per Collector instance

Every factor is an independent OOM lever:

Factor	What inflates it	Mitigation
active_traces	Traffic spike, too-long window	Scale replicas, shorten window
spans_per_trace	Long-running batch jobs, recursive instrumentation	Break long traces with span-links
bytes_per_span	Large attribute values, excessive metadata	Attribute value size limits
decision_window	Conservatively set too long	Tune per-service SLA; 30s is typical

Each lever is independent — you can fix a span explosion without scaling replicas, and you can fix a traffic spike without changing span structure. When the collector OOMs, inspect which of these four is growing before reaching for a larger instance.

Collector RAM is a product, not a single number: 10k req/s × 30s × 10 spans × 1 KB ≈ 3 GB. Each of the four levers is an independent OOM cause with its own fix.

The load-balancing exporter requirement

Multiple Collector replicas are standard in production (5–20 is common). Without routing, spans for one trace scatter randomly across replicas — no instance sees the full trace and none can make a correct policy decision.

The OTel Collector ships a loadbalancing exporter that hashes by trace-id and routes to a fixed replica. Discovery mechanisms: DNS round-robin, static list, or Kubernetes endpoint API.

Operational gotchas:

Scale events: when a Collector replica is added or removed, the hash ring re-balances. Spans in flight for traces that hash to the rebalanced shard may arrive at the new replica before earlier spans — the tail sampler sees a partial trace and may decide prematurely. Mitigation: short decision windows (30s), graceful drain on scale-down.
Replica loss: if a Collector replica dies mid-decision-window, its in-flight traces are lost. Tail sampling does not persist to disk by default. This is acceptable: losing traces during a Collector failure is a known operational trade-off.

The load-balancing exporter hashes by trace-id, so every span of trace ab (A1, A2, A3) lands on the same Collector (2); a different trace-id (cd) routes elsewhere. Only when one Collector holds the complete trace can the tail-sampler decide keep/drop correctly.

Preventing tail-sampler OOMs

A tail-sampler without a cap will buffer indefinitely until the process OOMs. The tail_sampling processor exposes num_traces as a hard cap on the number of active traces. When the cap is reached, the sampler begins evicting the oldest in-progress traces (incomplete traces are discarded).

Slack’s 2023 incident: tail-sampling Collectors took down their tracing pipeline during a major incident because num_traces was uncapped. Postmortem added the cap, added a separate high-priority always-keep tier for critical traces, and added an alert on otelcol_processor_dropped_spans_rate.

The full configuration pattern:

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000           # hard cap — evict oldest if exceeded
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

Alert metric	What it means	Threshold
otelcol_processor_tail_sampling_count_traces_on_memory	Active in-flight traces	Alert at 80% of num_traces cap
otelcol_processor_dropped_spans	Spans dropped due to cap	Alert at any non-zero rate
container_memory_working_set_bytes (Collector pod)	Actual RAM used	Alert at 85% of memory limit

Trace it

1/5

Tail-sampling Collector OOMs every few hours. Trace the root cause.

Step 1 of 5

Step 1: tail-sampler OOM — what is the memory model?

Locked

Step 2: check the metrics — which factor is growing?

Locked

Step 3: spans-per-trace blew up — typical cause?

Locked

Step 4: trace count blew up — typical cause?

Locked

Step 5: durable fix?

Tail-sampling Collector reference numbers

W3C Trace Context Level 1: Recommendation 2020-02
W3C Trace Context Level 2 (random-trace-id flag): Recommendation 2024
Typical tail decision window: 30s
Collector RAM at 50k traces × 100 spans × 1 KB: ~5 GB
Typical production Collector replicas: 5–20
Routing key for load-balancing exporter: trace-id (deterministic hash)

Quiz

A service emits 5% orphan spans for internal (non-entry-point) services. The team adds tail sampling. The orphan rate stays. What's misunderstood?

Quiz

A Collector replica is removed during a scale-down event. What happens to traces whose decision window has not yet closed on that replica?

Recall before you leave

01
How does deterministic trace-id hashing achieve consistent sampling without coordination between services?
02
Explain the tail-sampler OOM failure mode and the three independent mitigations.
03
Why must the load-balancing exporter use trace-id hashing rather than round-robin or least-connections?

Recap

Consistent sampling uses deterministic hash(trace-id) mod 100 so all services independently agree on keep/drop — partial traces never occur. The W3C Level 2 random-trace-id flag makes this safe by asserting uniform trace-id distribution. The tail-sampling Collector’s RAM is active_traces × spans/trace × bytes/span × decision_window; every factor is an independent OOM lever. A hard num_traces cap prevents unbounded growth; alerts on otelcol_processor_dropped_spans catch when the cap fires. The load-balancing exporter must use trace-id hashing — round-robin scatters a trace’s spans across replicas, making policy decisions wrong. Long-running traces (batch jobs accumulating thousands of spans) must be broken into sub-traces linked via span-links to stay within the decision window. Now when your tail-sampling collector starts climbing in memory, you know exactly which metric to check first and which configuration knob to turn — because you understand the formula, not just the symptom.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Head sampling and tail sampling: deciding which traces survivemiddle

unlocks

Async context per language, service mesh, B3 migration, and securitysenior

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.