Observability OBS · 03 · 09

OTel: config and trace reading

Read real Collector YAML, a tail-sampling config, a gctrace-style log line, and a manual-span snippet — then pick the behaviour or the highest-leverage fix.

OBS Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

OTel problems are diagnosed in YAML, log lines, and span code — not in slide diagrams. Read each artefact, predict the behaviour, and choose the fix a senior platform engineer would make first.

Goal

Practise the loop you run in every OTel incident: read the Collector config or the span code, predict what it does under load, and reach for the change that actually fixes the failure mode.

Snippet 1 — the processor order

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, memory_limiter, batch]
      exporters: [otlp/backend]

Quiz

Under a traffic spike this pipeline OOM-kills the gateway. What is wrong, and what is the fix?

Snippet 2 — the tail-sampling policy

processors:
  tail_sampling:
    decision_wait: 30s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow
        type: latency
        latency: { threshold_ms: 1000 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

Quiz

At a sustained 2,000 traces/sec the gateway starts dropping traces before decisions are made, and otelcol_processor_tail_sampling_count_traces_on_memory sits above num_traces. What is the sizing error?

Snippet 3 — the Collector self-metrics line

otelcol_processor_dropped_spans{processor="memory_limiter"} 18432
otelcol_receiver_refused_spans{receiver="otlp"} 9120
otelcol_exporter_send_failed_spans{exporter="otlp/backend"} 0
process_resident_memory_bytes 1.93e+09   # limit 2.0e+09

Quiz

Reading these four Collector self-metrics together, what is the diagnosis?

Snippet 4 — the manual span

const span = tracer.startSpan("fraud.check");
try {
  const score = await fraud.evaluate(order);
  span.setAttribute("fraud.score", score);
  if (score > threshold) throw new FraudError(order.id);
  return score;
} catch (err) {
  throw err;
}

Quiz

This manual span has two defects a senior reviewer flags immediately. What are they?

Recap

Every OTel artefact reads the same way: in a Collector pipeline, processor order is correctness — memory_limiter first or you OOM; tail-sampling num_traces must be sized for peak_rate × decision_wait × safety, not baseline; the Collector’s own self-metrics distinguish a memory-bound gateway (dropped/refused high, send_failed zero) from a backend outage (send_failed high); and a manual span must always end() in a finally and record exceptions, or it leaks and hides the very errors it was meant to capture. Read the config, predict the failure mode, fix the structural cause.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.