Observability
Operating the OTel Collector: reliability, version skew, failure modes, and governance
The Collector fails. Application errors spike but no alerts fire, no traces appear in the backend. The on-call engineer checks dashboards — all green, because the dashboards depend on the same Collector that just failed. OTel self-monitoring is not optional.
Reliability patterns
HA gateway — minimum 3 replicas: A single gateway pod failure loses all in-flight spans buffered in that pod. Three replicas mean one-pod failure is survivable with client retries. Behind a Kubernetes Service or cloud load balancer; the loadbalancing exporter on agents uses the service endpoint so scale-up/down is transparent to agents.
Persistent queue — the file_storage extension provides a disk-backed buffer that survives Collector restarts. Configure it on the gateway’s export pipelines to absorb 5-15 minutes of backend slowdown without dropping spans:
extensions:
file_storage:
directory: /var/otel/queue
exporters:
otlp/primary:
endpoint: backend:4317
sending_queue:
storage: file_storage
queue_size: 10000Health checks — liveness and readiness probes against the health_check extension on port 13133. Do not let a slow or overloaded Collector be considered ready; it will continue receiving spans it cannot process.
Self-monitoring — scrape the Collector’s /metrics endpoint (port 8888) and alert on:
otelcol_processor_dropped_spansrate > 0 — memory_limiter engaging; warn immediatelyotelcol_receiver_refused_spansrate > 0 — back-pressure at the receiver; correlates with memory_limiterotelcol_exporter_send_failed_spansrate > 0 — backend connectivity problemotelcol_exporter_queue_size/ queue capacity > 80% — exporter backlog building; backend slowotelcol_processor_tail_sampling_count_traces_on_memoryvsnum_traces— buffer exhaustion approachingprocess_resident_memory_bytesvs configured limit — approaching OOM
Resource sizing — a commodity gateway pod (4 CPU, 8 GB RAM) handles ~100-200k spans/sec with tail sampling. Size for peak + 2× headroom. Set CPU requests low and RAM requests/limits tight (memory_limiter should engage before Linux OOM killer).
| Reliability concern | Solution | Alert |
|---|---|---|
| Pod crash | 3+ replicas behind Service | PodRestartCount > 1/hr |
| Backend slowdown | Persistent queue (5-15 min) | queue_size > 80% capacity |
| Memory spike | memory_limiter drops before OOM | dropped_spans rate > 0 |
| Pipeline lag | Monitor (ObservedTimestamp - Timestamp) p99 | p99 lag > 60s |
Version skew and stability strategy
OTel is many independently versioned components: the spec (v1.x), each language SDK (varies), each Collector binary (v0.x with rapid releases), each Semantic Convention domain (HTTP 1.x, DB 1.x, etc.).
Compatibility: SDKs are forward-compatible with newer Collectors across multiple minor versions; OTLP is stable. The Collector has a notion of stable and beta components — production setups stick to stable receivers, processors, exporters.
Strategy:
- Pin SDK and Collector versions in deployment manifests
- Upgrade quarterly with a canary before fleet-wide rollout
- Track Semantic Convention versions per service so dashboards know what attribute names to expect
- Use the OTel Operator for Collector upgrades: CRD update triggers a rolling restart, zero downtime
Production failure modes
(a) Collector OOM under tail sampling: Gateway buffer grows past memory limit because decision_wait is too long or trace volume spiked. Mitigation: memory_limiter before tail_sampling; alert on dropped_spans; right-size num_traces for peak rate × decision_wait × 2.
(b) Tail-sample re-routing on scale events: Gateway pool scales up, loadbalancing exporter’s hash ring re-shuffles, in-flight traces lose some spans. Mitigation: pre-warm new pods, scale conservatively, use longer convergence windows on the loadbalancing exporter.
(c) OTLP version mismatch: A Collector upgraded ahead of SDKs encounters an unknown field in a newer OTLP proto; may silently drop attributes or the whole record. Mitigation: SDK and Collector compatibility matrix; staged upgrades; never upgrade Collector ahead of the SDKs it receives from.
(d) Auto-instrumentation footprint regression: A new minor version of the OTel Java Agent adds an instrumentation that slows a critical library. Mitigation: canary the agent upgrade; monitor p99 latency on the affected service; use per-instrumentation opt-out flags (OTEL_INSTRUMENTATION_X_ENABLED=false).
(e) Cardinality leak via auto-instrumentation: Auto-instrumented HTTP client adds url.full (the raw URL with query parameters) as an attribute, exploding cardinality at the metrics backend. Mitigation: configure the instrumentation to use http.route (templated) instead of url.full; strip query strings via an attributes processor at the Collector.
Semantic Convention governance
Semantic Conventions are how every team’s telemetry composes at fleet scale. Governance failures are expensive:
- Team-A names a field
route - Team-B names it
http_route - Team-C names it
http.route(the correct Semantic Convention name) - Cross-team dashboards use
http.route— teams A and B are invisible
Pattern: platform team publishes a per-language wrapper that pre-configures Semantic Convention attribute extraction. New services import the wrapper; CI lint rejects raw SDK usage in new code. The wrapper handles:
- HTTP route extraction (matched template, not raw URL)
- DB system tagging (
db.system=postgresql, not “psql”) - Redaction deny-lists
- Trace-context mixins for logs
Quarterly audit: check top-10 most-used attribute names per service for Semantic Convention drift. The audit output is the platform team’s backlog.
Why this works
Why is the Collector’s release cadence (~monthly) faster than the spec’s? The spec defines stable contracts (OTLP, signal data models, Semantic Conventions) that must evolve slowly for backward compatibility. The Collector is an implementation detail — it can add processors, receivers, and exporters in minor versions without breaking the spec. This means the Collector frequently ships new functionality (a new receiver, a new processor, a new OTTL capability) while the underlying spec contract stays stable. Production teams pin the Collector version and upgrade quarterly — not monthly — because even stable Collector releases occasionally change default behaviour in processors.
A Collector gateway pod's resident memory is at 1.92 GB of a 2 GB limit. otelcol_processor_dropped_spans is non-zero and otelcol_processor_tail_sampling_count_traces_on_memory is at 62,400 (num_traces configured as 50,000). What is the root cause and durable fix?
A new minor version of the OTel Java Agent adds an instrumentation for the company's internal RPC library. After upgrading, p99 latency on the order service rises 8%. What is the investigation and mitigation?
Order the operational steps for a safe OTel Collector version upgrade:
- 1 Check the Collector changelog for default-behaviour changes in processors used in production
- 2 Update the Collector version in the OTel Operator CRD for a canary gateway replica
- 3 Monitor canary for 24h: dropped_spans, refused_spans, exporter latency, tail_sampling buffer size
- 4 If canary is clean, apply the CRD update to remaining gateway replicas (rolling restart)
- 5 Update the pinned Collector version in the deployment manifests / GitOps repo
- 6 Add the upgrade to the quarterly SDK + Collector version audit
- 01Name five Collector self-monitoring metrics and what each indicates.
- 02What is a cardinality leak in the context of OTel auto-instrumentation, and how do you detect and fix it?
- 03Why does the OTel Collector version (v0.x) upgrade more frequently than the OTel spec, and what does this mean for production upgrade strategy?
The OTel Collector is critical-path observability infrastructure: if it fails, the observability stack fails silently. Production reliability requires three or more gateway replicas behind a load balancer, a persistent disk-backed queue (5-15 minutes of absorb capacity for backend slowdowns), health-check probes via the health_check extension, and self-monitoring — alert on dropped_spans, refused_spans, exporter failures, queue saturation, and tail_sampling buffer exhaustion. Version skew between SDKs and Collectors is managed by pinning versions and upgrading quarterly via canary. Common failure modes: OOM under tail sampling (fix: resize num_traces for peak_rate × decision_wait × 2); tail-sample re-routing during scale events (fix: pre-warm pods, scale conservatively); OTLP version mismatch (fix: staged upgrades); auto-instrumentation latency regression (fix: opt-out per instrumentation); cardinality leak from url.full (fix: switch to http.route). Semantic Convention governance — per-language SDK wrapper + CI lint — is the highest-leverage platform investment for preventing cross-team dashboard breakage.
appears again in202
- Federation and lookahead: batching beyond DataLoadermiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- Production observability: LoAF, INP, and the full attack surfacesenior
- Hidden classes, transition trees, and memory layoutmiddle
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Worker pools, Comlink, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- JSONB, arrays, and when a side table winsmiddle
- Schema integrity: deferral, versioning, and production failure modessenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Where data fetching happens — and why it decides LCPjunior
- React Server Components and Suspense streamingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- The IP envelopejunior
- Reading the IP headermiddle
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- What TLS does and why it existsjunior
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior