awesome-everything RU
↑ Back to the climb

Observability

Operating the OTel Collector: reliability, version skew, failure modes, and governance

Crux The Collector is critical-path observability infrastructure. HA gateway (3+ replicas), persistent queue, meta-monitoring, conservative version upgrades, and Semantic Convention governance — these are the disciplines that prevent silent telemetry loss.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 15 min

The Collector fails. Application errors spike but no alerts fire, no traces appear in the backend. The on-call engineer checks dashboards — all green, because the dashboards depend on the same Collector that just failed. OTel self-monitoring is not optional.

Reliability patterns

HA gateway — minimum 3 replicas: A single gateway pod failure loses all in-flight spans buffered in that pod. Three replicas mean one-pod failure is survivable with client retries. Behind a Kubernetes Service or cloud load balancer; the loadbalancing exporter on agents uses the service endpoint so scale-up/down is transparent to agents.

Persistent queue — the file_storage extension provides a disk-backed buffer that survives Collector restarts. Configure it on the gateway’s export pipelines to absorb 5-15 minutes of backend slowdown without dropping spans:

extensions:
  file_storage:
    directory: /var/otel/queue

exporters:
  otlp/primary:
    endpoint: backend:4317
    sending_queue:
      storage: file_storage
      queue_size: 10000

Health checks — liveness and readiness probes against the health_check extension on port 13133. Do not let a slow or overloaded Collector be considered ready; it will continue receiving spans it cannot process.

Self-monitoring — scrape the Collector’s /metrics endpoint (port 8888) and alert on:

  • otelcol_processor_dropped_spans rate > 0 — memory_limiter engaging; warn immediately
  • otelcol_receiver_refused_spans rate > 0 — back-pressure at the receiver; correlates with memory_limiter
  • otelcol_exporter_send_failed_spans rate > 0 — backend connectivity problem
  • otelcol_exporter_queue_size / queue capacity > 80% — exporter backlog building; backend slow
  • otelcol_processor_tail_sampling_count_traces_on_memory vs num_traces — buffer exhaustion approaching
  • process_resident_memory_bytes vs configured limit — approaching OOM

Resource sizing — a commodity gateway pod (4 CPU, 8 GB RAM) handles ~100-200k spans/sec with tail sampling. Size for peak + 2× headroom. Set CPU requests low and RAM requests/limits tight (memory_limiter should engage before Linux OOM killer).

Reliability concernSolutionAlert
Pod crash3+ replicas behind ServicePodRestartCount > 1/hr
Backend slowdownPersistent queue (5-15 min)queue_size > 80% capacity
Memory spikememory_limiter drops before OOMdropped_spans rate > 0
Pipeline lagMonitor (ObservedTimestamp - Timestamp) p99p99 lag > 60s

Version skew and stability strategy

OTel is many independently versioned components: the spec (v1.x), each language SDK (varies), each Collector binary (v0.x with rapid releases), each Semantic Convention domain (HTTP 1.x, DB 1.x, etc.).

Compatibility: SDKs are forward-compatible with newer Collectors across multiple minor versions; OTLP is stable. The Collector has a notion of stable and beta components — production setups stick to stable receivers, processors, exporters.

Strategy:

  1. Pin SDK and Collector versions in deployment manifests
  2. Upgrade quarterly with a canary before fleet-wide rollout
  3. Track Semantic Convention versions per service so dashboards know what attribute names to expect
  4. Use the OTel Operator for Collector upgrades: CRD update triggers a rolling restart, zero downtime

Production failure modes

(a) Collector OOM under tail sampling: Gateway buffer grows past memory limit because decision_wait is too long or trace volume spiked. Mitigation: memory_limiter before tail_sampling; alert on dropped_spans; right-size num_traces for peak rate × decision_wait × 2.

(b) Tail-sample re-routing on scale events: Gateway pool scales up, loadbalancing exporter’s hash ring re-shuffles, in-flight traces lose some spans. Mitigation: pre-warm new pods, scale conservatively, use longer convergence windows on the loadbalancing exporter.

(c) OTLP version mismatch: A Collector upgraded ahead of SDKs encounters an unknown field in a newer OTLP proto; may silently drop attributes or the whole record. Mitigation: SDK and Collector compatibility matrix; staged upgrades; never upgrade Collector ahead of the SDKs it receives from.

(d) Auto-instrumentation footprint regression: A new minor version of the OTel Java Agent adds an instrumentation that slows a critical library. Mitigation: canary the agent upgrade; monitor p99 latency on the affected service; use per-instrumentation opt-out flags (OTEL_INSTRUMENTATION_X_ENABLED=false).

(e) Cardinality leak via auto-instrumentation: Auto-instrumented HTTP client adds url.full (the raw URL with query parameters) as an attribute, exploding cardinality at the metrics backend. Mitigation: configure the instrumentation to use http.route (templated) instead of url.full; strip query strings via an attributes processor at the Collector.

Semantic Convention governance

Semantic Conventions are how every team’s telemetry composes at fleet scale. Governance failures are expensive:

  • Team-A names a field route
  • Team-B names it http_route
  • Team-C names it http.route (the correct Semantic Convention name)
  • Cross-team dashboards use http.route — teams A and B are invisible

Pattern: platform team publishes a per-language wrapper that pre-configures Semantic Convention attribute extraction. New services import the wrapper; CI lint rejects raw SDK usage in new code. The wrapper handles:

  • HTTP route extraction (matched template, not raw URL)
  • DB system tagging (db.system=postgresql, not “psql”)
  • Redaction deny-lists
  • Trace-context mixins for logs

Quarterly audit: check top-10 most-used attribute names per service for Semantic Convention drift. The audit output is the platform team’s backlog.

Why this works

Why is the Collector’s release cadence (~monthly) faster than the spec’s? The spec defines stable contracts (OTLP, signal data models, Semantic Conventions) that must evolve slowly for backward compatibility. The Collector is an implementation detail — it can add processors, receivers, and exporters in minor versions without breaking the spec. This means the Collector frequently ships new functionality (a new receiver, a new processor, a new OTTL capability) while the underlying spec contract stays stable. Production teams pin the Collector version and upgrade quarterly — not monthly — because even stable Collector releases occasionally change default behaviour in processors.

Quiz

A Collector gateway pod's resident memory is at 1.92 GB of a 2 GB limit. otelcol_processor_dropped_spans is non-zero and otelcol_processor_tail_sampling_count_traces_on_memory is at 62,400 (num_traces configured as 50,000). What is the root cause and durable fix?

Quiz

A new minor version of the OTel Java Agent adds an instrumentation for the company's internal RPC library. After upgrading, p99 latency on the order service rises 8%. What is the investigation and mitigation?

Order the steps

Order the operational steps for a safe OTel Collector version upgrade:

  1. 1 Check the Collector changelog for default-behaviour changes in processors used in production
  2. 2 Update the Collector version in the OTel Operator CRD for a canary gateway replica
  3. 3 Monitor canary for 24h: dropped_spans, refused_spans, exporter latency, tail_sampling buffer size
  4. 4 If canary is clean, apply the CRD update to remaining gateway replicas (rolling restart)
  5. 5 Update the pinned Collector version in the deployment manifests / GitOps repo
  6. 6 Add the upgrade to the quarterly SDK + Collector version audit
Recall before you leave
  1. 01
    Name five Collector self-monitoring metrics and what each indicates.
  2. 02
    What is a cardinality leak in the context of OTel auto-instrumentation, and how do you detect and fix it?
  3. 03
    Why does the OTel Collector version (v0.x) upgrade more frequently than the OTel spec, and what does this mean for production upgrade strategy?
Recap

The OTel Collector is critical-path observability infrastructure: if it fails, the observability stack fails silently. Production reliability requires three or more gateway replicas behind a load balancer, a persistent disk-backed queue (5-15 minutes of absorb capacity for backend slowdowns), health-check probes via the health_check extension, and self-monitoring — alert on dropped_spans, refused_spans, exporter failures, queue saturation, and tail_sampling buffer exhaustion. Version skew between SDKs and Collectors is managed by pinning versions and upgrading quarterly via canary. Common failure modes: OOM under tail sampling (fix: resize num_traces for peak_rate × decision_wait × 2); tail-sample re-routing during scale events (fix: pre-warm pods, scale conservatively); OTLP version mismatch (fix: staged upgrades); auto-instrumentation latency regression (fix: opt-out per instrumentation); cardinality leak from url.full (fix: switch to http.route). Semantic Convention governance — per-language SDK wrapper + CI lint — is the highest-leverage platform investment for preventing cross-team dashboard breakage.

Connected lessons
appears again in202
Continue the climb ↑OTel: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.