awesome-everything RU
↑ Back to the climb

Observability

Sampling consistency and the tail-sampling Collector tier

Crux Consistent sampling uses deterministic trace-id hashing so all services agree on keep/drop without coordination. The tail-sampling Collector tier requires a load-balancing exporter, a RAM budget model, and an explicit cap to prevent OOMs from long-running traces.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

A tail-sampling Collector OOMs every few hours during peak traffic. The team bumps the memory limit and it OOMs again. Nobody asked: why does memory grow, what controls it, and what caps it?

Sampling consistency across services

“Consistent sampling” means all spans of one trace are either all-sampled or all-dropped. No partial traces — either the backend has a complete trace or it has nothing.

The mechanism: probabilistic head-samplers use a deterministic hash of the trace-id modulo 100 to decide which traces to keep. Because the hash is deterministic and the trace-id is the same across all services (propagated by traceparent), every service independently arrives at the same keep/drop decision. No coordination is needed.

W3C Trace Context Level 2 (2024) formalises this with the random-trace-id flag (bit 1, value 02): when set, the trace-id is guaranteed to be uniformly random, making the hash-based consistent sampling safe to rely on. Consistent hash-based samplers read this flag as a prerequisite; if it is absent, they may fall back to a simpler approach.

Why partial traces are worse than no traces:

  • A trace missing service B’s span looks like service A called service C directly. Latency attribution is wrong.
  • Percentile calculations over partial traces produce systematic bias — short spans are over-represented.
  • On-call engineers diagnose based on what they see; partial traces lead to confident wrong diagnoses.

The tail-sampler memory model

The tail-sampling Collector holds all spans of every active trace in memory until the decision window closes. Memory usage:

RAM = active_traces × avg_spans_per_trace × bytes_per_span
    = (in_flight_request_rate × decision_window_seconds) × avg_spans × span_size

At 10k req/s, 30s window, 10 spans/trace, 1 KB/span: 10,000 × 30 × 10 × 1,024 = ~3 GB per Collector instance

Every factor is an independent OOM lever:

FactorWhat inflates itMitigation
active_tracesTraffic spike, too-long windowScale replicas, shorten window
spans_per_traceLong-running batch jobs, recursive instrumentationBreak long traces with span-links
bytes_per_spanLarge attribute values, excessive metadataAttribute value size limits
decision_windowConservatively set too longTune per-service SLA; 30s is typical

The load-balancing exporter requirement

Multiple Collector replicas are standard in production (5–20 is common). Without routing, spans for one trace scatter randomly across replicas — no instance sees the full trace and none can make a correct policy decision.

The OTel Collector ships a loadbalancing exporter that hashes by trace-id and routes to a fixed replica. Discovery mechanisms: DNS round-robin, static list, or Kubernetes endpoint API.

Operational gotchas:

  • Scale events: when a Collector replica is added or removed, the hash ring re-balances. Spans in flight for traces that hash to the rebalanced shard may arrive at the new replica before earlier spans — the tail sampler sees a partial trace and may decide prematurely. Mitigation: short decision windows (30s), graceful drain on scale-down.
  • Replica loss: if a Collector replica dies mid-decision-window, its in-flight traces are lost. Tail sampling does not persist to disk by default. This is acceptable: losing traces during a Collector failure is a known operational trade-off.

Preventing tail-sampler OOMs

A tail-sampler without a cap will buffer indefinitely until the process OOMs. The tail_sampling processor exposes num_traces as a hard cap on the number of active traces. When the cap is reached, the sampler begins evicting the oldest in-progress traces (incomplete traces are discarded).

Slack’s 2023 incident: tail-sampling Collectors took down their tracing pipeline during a major incident because num_traces was uncapped. Postmortem added the cap, added a separate high-priority always-keep tier for critical traces, and added an alert on otelcol_processor_dropped_spans_rate.

The full configuration pattern:

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 100000           # hard cap — evict oldest if exceeded
    policies:
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 2000
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 1
Alert metricWhat it meansThreshold
otelcol_processor_tail_sampling_count_traces_on_memoryActive in-flight tracesAlert at 80% of num_traces cap
otelcol_processor_dropped_spansSpans dropped due to capAlert at any non-zero rate
container_memory_working_set_bytes (Collector pod)Actual RAM usedAlert at 85% of memory limit
Trace it
1/5

Tail-sampling Collector OOMs every few hours. Trace the root cause.

1
Step 1 of 5
Step 1: tail-sampler OOM — what is the memory model?
2
Locked
Step 2: check the metrics — which factor is growing?
3
Locked
Step 3: spans-per-trace blew up — typical cause?
4
Locked
Step 4: trace count blew up — typical cause?
5
Locked
Step 5: durable fix?
Tail-sampling Collector reference numbers
W3C Trace Context Level 1
Recommendation 2020-02
W3C Trace Context Level 2 (random-trace-id flag)
Recommendation 2024
Typical tail decision window
30s
Collector RAM at 50k traces × 100 spans × 1 KB
~5 GB
Typical production Collector replicas
5–20
Routing key for load-balancing exporter
trace-id (deterministic hash)
Quiz

A service emits 5% orphan spans for internal (non-entry-point) services. The team adds tail sampling. The orphan rate stays. What's misunderstood?

Quiz

A Collector replica is removed during a scale-down event. What happens to traces whose decision window has not yet closed on that replica?

Recall before you leave
  1. 01
    How does deterministic trace-id hashing achieve consistent sampling without coordination between services?
  2. 02
    Explain the tail-sampler OOM failure mode and the three independent mitigations.
  3. 03
    Why must the load-balancing exporter use trace-id hashing rather than round-robin or least-connections?
Recap

Consistent sampling uses deterministic hash(trace-id) mod 100 so all services independently agree on keep/drop — partial traces never occur. The W3C Level 2 random-trace-id flag makes this safe by asserting uniform trace-id distribution. The tail-sampling Collector’s RAM is active_traces × spans/trace × bytes/span × decision_window; every factor is an independent OOM lever. A hard num_traces cap prevents unbounded growth; alerts on otelcol_processor_dropped_spans catch when the cap fires. The load-balancing exporter must use trace-id hashing — round-robin scatters a trace’s spans across replicas, making policy decisions wrong. Long-running traces (batch jobs accumulating thousands of spans) must be broken into sub-traces linked via span-links to stay within the decision window.

Connected lessons
Continue the climb ↑Async context per language, service mesh, B3 migration, and security
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.