Observability OBS · 03 · 07

Operating the OTel Collector: reliability, version skew, failure modes, and governance

The Collector is critical-path observability infrastructure. HA gateway (3+ replicas), persistent queue, meta-monitoring, conservative version upgrades, and Semantic Convention governance — these are the disciplines that prevent silent telemetry loss.

OBS Senior ◷ 15 min

Level

FoundationsJuniorMiddleSenior

The Collector fails. Application errors spike but no alerts fire, no traces appear in the backend. The on-call engineer checks dashboards — all green, because the dashboards depend on the same Collector that just failed. OTel self-monitoring is not optional.

Reliability patterns

When the Collector fails, you lose visibility at the exact moment you need it most — this lesson tells you which five practices prevent that and which metrics tell you when you are about to hit the wall before you actually do.

HA gateway — minimum 3 replicas: A single gateway pod failure loses all in-flight spans buffered in that pod. Three replicas mean one-pod failure is survivable with client retries. Behind a Kubernetes Service or cloud load balancer; the loadbalancing exporter on agents uses the service endpoint so scale-up/down is transparent to agents.

Persistent queue — the file_storage extension provides a disk-backed buffer that survives Collector restarts. Configure it on the gateway’s export pipelines to absorb 5-15 minutes of backend slowdown without dropping spans:

extensions:
  file_storage:
    directory: /var/otel/queue

exporters:
  otlp/primary:
    endpoint: backend:4317
    sending_queue:
      storage: file_storage
      queue_size: 10000

Health checks — liveness and readiness probes against the health_check extension on port 13133. Do not let a slow or overloaded Collector be considered ready; it will continue receiving spans it cannot process.

Self-monitoring — scrape the Collector’s /metrics endpoint (port 8888) and alert on:

otelcol_processor_dropped_spans rate > 0 — memory_limiter engaging; warn immediately
otelcol_receiver_refused_spans rate > 0 — back-pressure at the receiver; correlates with memory_limiter
otelcol_exporter_send_failed_spans rate > 0 — backend connectivity problem
otelcol_exporter_queue_size / queue capacity > 80% — exporter backlog building; backend slow
otelcol_processor_tail_sampling_count_traces_on_memory vs num_traces — buffer exhaustion approaching
process_resident_memory_bytes vs configured limit — approaching OOM

Resource sizing — a commodity gateway pod (4 CPU, 8 GB RAM) handles ~100-200k spans/sec with tail sampling. Size for peak + 2× headroom. Set CPU requests low and RAM requests/limits tight (memory_limiter should engage before Linux OOM killer).

Reliability concern	Solution	Alert
Pod crash	3+ replicas behind Service	PodRestartCount > 1/hr
Backend slowdown	Persistent queue (5-15 min)	queue_size > 80% capacity
Memory spike	memory_limiter drops before OOM	dropped_spans rate > 0
Pipeline lag	Monitor (ObservedTimestamp - Timestamp) p99	p99 lag > 60s

Version skew and stability strategy

OTel is many independently versioned components: the spec (v1.x), each language SDK (varies), each Collector binary (v0.x with rapid releases), each Semantic Convention domain (HTTP 1.x, DB 1.x, etc.).

Compatibility: SDKs are forward-compatible with newer Collectors across multiple minor versions; OTLP is stable. The Collector has a notion of stable and beta components — production setups stick to stable receivers, processors, exporters.

Strategy:

Pin SDK and Collector versions in deployment manifests
Upgrade quarterly with a canary before fleet-wide rollout
Track Semantic Convention versions per service so dashboards know what attribute names to expect
Use the OTel Operator for Collector upgrades: CRD update triggers a rolling restart, zero downtime

Production failure modes

(a) Collector OOM under tail sampling: Gateway buffer grows past memory limit because decision_wait is too long or trace volume spiked. Mitigation: memory_limiter before tail_sampling; alert on dropped_spans; right-size num_traces for peak rate × decision_wait × 2.

(b) Tail-sample re-routing on scale events: Gateway pool scales up, loadbalancing exporter’s hash ring re-shuffles, in-flight traces lose some spans. Mitigation: pre-warm new pods, scale conservatively, use longer convergence windows on the loadbalancing exporter.

(c) OTLP version mismatch: A Collector upgraded ahead of SDKs encounters an unknown field in a newer OTLP proto; may silently drop attributes or the whole record. Mitigation: SDK and Collector compatibility matrix; staged upgrades; never upgrade Collector ahead of the SDKs it receives from.

(d) Auto-instrumentation footprint regression: A new minor version of the OTel Java Agent adds an instrumentation that slows a critical library. Mitigation: canary the agent upgrade; monitor p99 latency on the affected service; use per-instrumentation opt-out flags (OTEL_INSTRUMENTATION_X_ENABLED=false).

(e) Cardinality leak via auto-instrumentation: Auto-instrumented HTTP client adds url.full (the raw URL with query parameters) as an attribute, exploding cardinality at the metrics backend. Mitigation: configure the instrumentation to use http.route (templated) instead of url.full; strip query strings via an attributes processor at the Collector.

These five failure modes share a pattern: each is detectable before it causes an outage — through self-monitoring metrics, canary upgrades, or cardinality alerts — but only if you instrument the Collector itself as rigorously as you instrument your application.

Each failure mode silently loses telemetry at scale, yet each surfaces in a self-monitoring metric, a canary, or a cardinality alert before it becomes an outage — if you instrument the Collector as rigorously as the app.

Semantic Convention governance

Semantic Conventions are how every team’s telemetry composes at fleet scale. Governance failures are expensive:

Team-A names a field route
Team-B names it http_route
Team-C names it http.route (the correct Semantic Convention name)
Cross-team dashboards use http.route — teams A and B are invisible

Pattern: platform team publishes a per-language wrapper that pre-configures Semantic Convention attribute extraction. New services import the wrapper; CI lint rejects raw SDK usage in new code. The wrapper handles:

HTTP route extraction (matched template, not raw URL)
DB system tagging (db.system=postgresql, not “psql”)
Redaction deny-lists
Trace-context mixins for logs

Quarterly audit: check top-10 most-used attribute names per service for Semantic Convention drift. The audit output is the platform team’s backlog.

▸Why this works

Why is the Collector’s release cadence (~monthly) faster than the spec’s? The spec defines stable contracts (OTLP, signal data models, Semantic Conventions) that must evolve slowly for backward compatibility. The Collector is an implementation detail — it can add processors, receivers, and exporters in minor versions without breaking the spec. This means the Collector frequently ships new functionality (a new receiver, a new processor, a new OTTL capability) while the underlying spec contract stays stable. Production teams pin the Collector version and upgrade quarterly — not monthly — because even stable Collector releases occasionally change default behaviour in processors.

Quiz

A Collector gateway pod's resident memory is at 1.92 GB of a 2 GB limit. otelcol_processor_dropped_spans is non-zero and otelcol_processor_tail_sampling_count_traces_on_memory is at 62,400 (num_traces configured as 50,000). What is the root cause and durable fix?

Quiz

A new minor version of the OTel Java Agent adds an instrumentation for the company's internal RPC library. After upgrading, p99 latency on the order service rises 8%. What is the investigation and mitigation?

Order the steps

Order the operational steps for a safe OTel Collector version upgrade:

1 Check the Collector changelog for default-behaviour changes in processors used in production
2 Update the Collector version in the OTel Operator CRD for a canary gateway replica
3 Monitor canary for 24h: dropped_spans, refused_spans, exporter latency, tail_sampling buffer size
4 If canary is clean, apply the CRD update to remaining gateway replicas (rolling restart)
5 Update the pinned Collector version in the deployment manifests / GitOps repo
6 Add the upgrade to the quarterly SDK + Collector version audit

Two-tier deployment: lightweight agents (one per host or as sidecars) forward via the loadbalancing exporter to a load-balanced gateway pool (3+ replicas) that does tail sampling and batching before export. Scaling and HA live in the gateway tier; agents stay thin.

Recall before you leave

01
Name five Collector self-monitoring metrics and what each indicates.
02
What is a cardinality leak in the context of OTel auto-instrumentation, and how do you detect and fix it?
03
Why does the OTel Collector version (v0.x) upgrade more frequently than the OTel spec, and what does this mean for production upgrade strategy?

Recap

The OTel Collector is critical-path observability infrastructure: if it fails, the observability stack fails silently. Production reliability requires three or more gateway replicas behind a load balancer, a persistent disk-backed queue (5-15 minutes of absorb capacity for backend slowdowns), health-check probes via the health_check extension, and self-monitoring — alert on dropped_spans, refused_spans, exporter failures, queue saturation, and tail_sampling buffer exhaustion. Version skew between SDKs and Collectors is managed by pinning versions and upgrading quarterly via canary. Common failure modes: OOM under tail sampling (fix: resize num_traces for peak_rate × decision_wait × 2); tail-sample re-routing during scale events (fix: pre-warm pods, scale conservatively); OTLP version mismatch (fix: staged upgrades); auto-instrumentation latency regression (fix: opt-out per instrumentation); cardinality leak from url.full (fix: switch to http.route). Semantic Convention governance — per-language SDK wrapper + CI lint — is the highest-leverage platform investment for preventing cross-team dashboard breakage. Now when you see otelcol_processor_dropped_spans go non-zero during an incident, you know the Collector is the problem — and you know exactly which metric to check next.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior

appears again in205

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.