Observability
Sampling consistency and the tail-sampling Collector tier
A tail-sampling Collector OOMs every few hours during peak traffic. The team bumps the memory limit and it OOMs again. Nobody asked: why does memory grow, what controls it, and what caps it?
Sampling consistency across services
“Consistent sampling” means all spans of one trace are either all-sampled or all-dropped. No partial traces — either the backend has a complete trace or it has nothing.
The mechanism: probabilistic head-samplers use a deterministic hash of the trace-id modulo 100 to decide which traces to keep. Because the hash is deterministic and the trace-id is the same across all services (propagated by traceparent), every service independently arrives at the same keep/drop decision. No coordination is needed.
W3C Trace Context Level 2 (2024) formalises this with the random-trace-id flag (bit 1, value 02): when set, the trace-id is guaranteed to be uniformly random, making the hash-based consistent sampling safe to rely on. Consistent hash-based samplers read this flag as a prerequisite; if it is absent, they may fall back to a simpler approach.
Why partial traces are worse than no traces:
- A trace missing service B’s span looks like service A called service C directly. Latency attribution is wrong.
- Percentile calculations over partial traces produce systematic bias — short spans are over-represented.
- On-call engineers diagnose based on what they see; partial traces lead to confident wrong diagnoses.
The tail-sampler memory model
The tail-sampling Collector holds all spans of every active trace in memory until the decision window closes. Memory usage:
RAM = active_traces × avg_spans_per_trace × bytes_per_span
= (in_flight_request_rate × decision_window_seconds) × avg_spans × span_sizeAt 10k req/s, 30s window, 10 spans/trace, 1 KB/span:
10,000 × 30 × 10 × 1,024 = ~3 GB per Collector instance
Every factor is an independent OOM lever:
| Factor | What inflates it | Mitigation |
|---|---|---|
| active_traces | Traffic spike, too-long window | Scale replicas, shorten window |
| spans_per_trace | Long-running batch jobs, recursive instrumentation | Break long traces with span-links |
| bytes_per_span | Large attribute values, excessive metadata | Attribute value size limits |
| decision_window | Conservatively set too long | Tune per-service SLA; 30s is typical |
The load-balancing exporter requirement
Multiple Collector replicas are standard in production (5–20 is common). Without routing, spans for one trace scatter randomly across replicas — no instance sees the full trace and none can make a correct policy decision.
The OTel Collector ships a loadbalancing exporter that hashes by trace-id and routes to a fixed replica. Discovery mechanisms: DNS round-robin, static list, or Kubernetes endpoint API.
Operational gotchas:
- Scale events: when a Collector replica is added or removed, the hash ring re-balances. Spans in flight for traces that hash to the rebalanced shard may arrive at the new replica before earlier spans — the tail sampler sees a partial trace and may decide prematurely. Mitigation: short decision windows (30s), graceful drain on scale-down.
- Replica loss: if a Collector replica dies mid-decision-window, its in-flight traces are lost. Tail sampling does not persist to disk by default. This is acceptable: losing traces during a Collector failure is a known operational trade-off.
Preventing tail-sampler OOMs
A tail-sampler without a cap will buffer indefinitely until the process OOMs. The tail_sampling processor exposes num_traces as a hard cap on the number of active traces. When the cap is reached, the sampler begins evicting the oldest in-progress traces (incomplete traces are discarded).
Slack’s 2023 incident: tail-sampling Collectors took down their tracing pipeline during a major incident because num_traces was uncapped. Postmortem added the cap, added a separate high-priority always-keep tier for critical traces, and added an alert on otelcol_processor_dropped_spans_rate.
The full configuration pattern:
processors:
tail_sampling:
decision_wait: 30s
num_traces: 100000 # hard cap — evict oldest if exceeded
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-traces
type: latency
latency:
threshold_ms: 2000
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 1| Alert metric | What it means | Threshold |
|---|---|---|
| otelcol_processor_tail_sampling_count_traces_on_memory | Active in-flight traces | Alert at 80% of num_traces cap |
| otelcol_processor_dropped_spans | Spans dropped due to cap | Alert at any non-zero rate |
| container_memory_working_set_bytes (Collector pod) | Actual RAM used | Alert at 85% of memory limit |
Tail-sampling Collector OOMs every few hours. Trace the root cause.
- W3C Trace Context Level 1
- Recommendation 2020-02
- W3C Trace Context Level 2 (random-trace-id flag)
- Recommendation 2024
- Typical tail decision window
- 30s
- Collector RAM at 50k traces × 100 spans × 1 KB
- ~5 GB
- Typical production Collector replicas
- 5–20
- Routing key for load-balancing exporter
- trace-id (deterministic hash)
A service emits 5% orphan spans for internal (non-entry-point) services. The team adds tail sampling. The orphan rate stays. What's misunderstood?
A Collector replica is removed during a scale-down event. What happens to traces whose decision window has not yet closed on that replica?
- 01How does deterministic trace-id hashing achieve consistent sampling without coordination between services?
- 02Explain the tail-sampler OOM failure mode and the three independent mitigations.
- 03Why must the load-balancing exporter use trace-id hashing rather than round-robin or least-connections?
Consistent sampling uses deterministic hash(trace-id) mod 100 so all services independently agree on keep/drop — partial traces never occur. The W3C Level 2 random-trace-id flag makes this safe by asserting uniform trace-id distribution. The tail-sampling Collector’s RAM is active_traces × spans/trace × bytes/span × decision_window; every factor is an independent OOM lever. A hard num_traces cap prevents unbounded growth; alerts on otelcol_processor_dropped_spans catch when the cap fires. The load-balancing exporter must use trace-id hashing — round-robin scatters a trace’s spans across replicas, making policy decisions wrong. Long-running traces (batch jobs accumulating thousands of spans) must be broken into sub-traces linked via span-links to stay within the decision window.