Crux Read real Collector YAML, a tail-sampling config, a gctrace-style log line, and a manual-span snippet — then pick the behaviour or the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
OTel problems are diagnosed in YAML, log lines, and span code — not in slide diagrams. Read each artefact, predict the behaviour, and choose the fix a senior platform engineer would make first.
Goal
Practise the loop you run in every OTel incident: read the Collector config or the span code, predict what it does under load, and reach for the change that actually fixes the failure mode.
Under a traffic spike this pipeline OOM-kills the gateway. What is wrong, and what is the fix?
Heads-up batch and tail_sampling coexist fine — tail_sampling buffers traces, batch groups the kept ones for export. The defect is that memory_limiter runs last, too late to protect against the buffering.
Heads-up Rate-limiting masks the symptom. The structural bug is memory_limiter running after the processor that consumes the most memory — reorder it first.
Heads-up A second exporter does nothing for inbound buffering memory. The OOM is in the tail_sampling buffer, governed by processor order and num_traces, not export fan-out.
At a sustained 2,000 traces/sec the gateway starts dropping traces before decisions are made, and otelcol_processor_tail_sampling_count_traces_on_memory sits above num_traces. What is the sizing error?
Heads-up Raising decision_wait makes it worse: a longer window means more concurrent in-flight traces, so the buffer needs to be even larger. The fix is sizing num_traces for the rate × window, not extending the window.
Heads-up The baseline percentage controls how many non-error, non-slow traces are kept after the decision — it does not control how many traces are buffered while waiting. Buffer pressure is set by peak_rate × decision_wait.
Heads-up Policies are OR-ed: a trace kept by any policy is kept once. They do not double-count or cause the buffer overflow — undersized num_traces does.
Reading these four Collector self-metrics together, what is the diagnosis?
Heads-up otelcol_exporter_send_failed_spans is 0, so the backend is accepting exports fine. The drops are from memory_limiter under RAM pressure, an upstream capacity problem, not a backend outage.
Heads-up refused_spans here is back-pressure: the receiver refuses because the processor chain (memory_limiter) is shedding load near the RAM ceiling. The receiver config is fine; the Collector is simply out of memory headroom.
Heads-up Non-zero dropped_spans and refused_spans with RSS at 97% of limit is the signature of an under-provisioned gateway, not steady-state health. A healthy gateway shows these at zero.
This manual span has two defects a senior reviewer flags immediately. What are they?
Heads-up Setting attributes inside the try is fine and normal. The real defects are the missing span.end() (the span never flushes) and the unrecorded exception (the error never appears on the span).
Heads-up The span name is acceptable and does not affect sampling correctness. The defects are lifecycle: the span is never ended and the error is never recorded on it.
Heads-up await is fine inside a span; OTel context propagates across awaits via the runtime's context manager. The actual bugs are the missing end() and the unrecorded exception.
Recap
Every OTel artefact reads the same way: in a Collector pipeline, processor order is correctness — memory_limiter first or you OOM; tail-sampling num_traces must be sized for peak_rate × decision_wait × safety, not baseline; the Collector’s own self-metrics distinguish a memory-bound gateway (dropped/refused high, send_failed zero) from a backend outage (send_failed high); and a manual span must always end() in a finally and record exceptions, or it leaks and hides the very errors it was meant to capture. Read the config, predict the failure mode, fix the structural cause.