Observability
Production propagation failures, span links, and platform design
GitHub ran a propagation regression for a quarter where 50% of internal traces were orphans. The tracing dashboards showed traces the whole time. Nobody noticed until an engineer spot-checked the orphan-span rate in a routine review.
Real production propagation failures
Uber 2019: a partial OTel rollout caused 30% of traces to break at the boundary between instrumented and uninstrumented services. Postmortem mandated a “no service ships without W3C propagation” gate enforced in CI. The pattern: instrumented services emit perfect spans; uninstrumented services emit orphans; the two groups show different trace depths in dashboards, but there is no automated alert on the boundary failure.
GitHub 2022: a custom HTTP client wrapper bypassed OTel’s hooks and silently dropped traceparent across half their internal services for a quarter before someone noticed the orphan-span rate had risen from 1% to 50%. The fix was a single line to wrap the client in the OTel-aware version. The lesson: custom wrappers are the most common propagation gap in mature services. The fix is always a one-liner; finding it takes a quarter.
Slack 2023: tail-sampling Collectors OOMed and took down their tracing pipeline during a major incident — precisely when tracing was most needed. Postmortem added num_traces caps and a separate always-keep tier for high-priority traces. A monitoring gap: OTel Collector health metrics were not on any SLO dashboard.
Datadog 2024 customer report: a large Java workload had a thread pool that didn’t carry context across submitted tasks, so 80% of background-task traces were orphans. Fix: switch to a CurrentTraceContext-aware executor. The bug was present for months; it was discovered during a quarterly orphan-rate review.
The shared pattern: propagation bugs are silent. The dashboards keep showing traces. The only detection mechanism is a metric on orphan-span rate, and that metric must be on a dashboard and ideally on an alert — it is never surfaced automatically by OTel defaults.
Observability for propagation itself
The essential propagation health metrics:
| Metric | Normal | Signal when |
|---|---|---|
orphan_span_rate by service.name | <1% (entry-points only) | Internal service >5% → propagation regression |
invalid_traceparent_received count | ~0 | Any sustained rate → broken upstream |
trace_id_per_second | Proportional to RPS × sample_rate | Sudden spike → fresh trace-ids (propagation lost) |
broken_parent_count | <0.5% | Spans whose parent-id exists in no other span in the same trace |
| Healthy state | Threshold | Alert action |
|---|---|---|
| Orphan spans for internal services | <1% | Page if >5% for 10 min for specific service |
| invalid_traceparent_received | <0.01% | Ticket if non-zero rate sustained >5 min |
| broken_parent_count | <0.5% | Ticket if >2% for 10 min |
Span links: when the parent-child tree breaks down
The parent-child model assumes linear causality: A calls B, B calls C. This breaks in three scenarios:
- Batch processing: a consumer pulls 1,000 messages from Kafka and processes them in one batch. There is no single meaningful “parent” — 1,000 incoming trace contexts feed one batch span.
- Fan-in: multiple parallel sub-jobs converge at a join point. Each sub-job is a child of its own branch; the join point has multiple causal contributors.
- Async follow-ups: the originating request finishes and returns to the user, but spawns follow-ups that execute hours later. The original request’s context is closed; the follow-ups need a causal link without being children of a dead span.
OTel’s span-links solve all of these: a span declares additional SpanContext references it is causally related to but does not descend from. Tracing backends visualise links as dotted lines alongside the solid parent-child tree.
The senior pattern: any trace longer than 30 seconds or wider than 100 spans is a candidate for span-link refactoring. Split the long workflow into sub-traces where each sub-trace fits within the tail-sampling decision window, and use links to preserve causal lineage. This keeps traces small, keeps the sampler happy, and preserves the investigation chain.
Long-running traces and the 30-minute problem
Tail samplers have decision windows of 30s–5min. A batch job running for 30 minutes emits spans long after the decision window closes; the late spans look like orphans to the sampler.
Two production patterns:
- Break the work: split long workflows into sub-traces linked via span-links, each fitting in the decision window. Clean architecture, correct by construction.
- Backend late-span support: Tempo, Honeycomb, and Datadog all support late-arriving spans up to 24h after trace start. Skip tail sampling for long traces; use head sampling at 100% for batch workloads. Practical retrofit for legacy batch jobs.
The decision window is the lever to adjust when batch workloads break tail sampling. Tuning it upward increases Collector RAM; the right answer is usually to break the work.
A 0.5% orphan rate is detected for an internal service. Trace the root cause.
Diagnose a broken trace from tracing-backend output
# Query: trace_id == "4bf92f3577b34da6a3ce929d0e0e4736"
# Result: 7 spans total
# service span_id parent_id duration status
1 api-gateway 1a2b3c4d5e6f7890 - 18ms OK
2 auth 7890abcdef123456 1a2b3c4d5e6f7890 14ms OK
3 inventory abcdef1234567890 1a2b3c4d5e6f7890 1200ms OK
4 payment fedcba0987654321 - 80ms OK # ORPHAN
5 postgres-client 1111222233334444 fedcba0987654321 55ms OK
6 email-job 5555666677778888 - 240ms OK # ORPHAN
7 audit-log 9999aaaabbbbcccc - 12ms OK # ORPHAN
# Also separate orphan traces with single spans:
# trace_id 9981a... payment service, 78ms
# trace_id ab32c... email-job service, 280ms
# trace_id ff8e1... audit-log service, 14ms The trace contains 7 spans but 3 are orphans (no parent_id) and 3 single-span orphan traces with the same service names exist. What's happening?
Design end-to-end trace propagation for a new platform with 30 microservices, browser frontend, Kafka backbone, a service mesh, and a tail-sampling collector tier.
- Polyglot: 12 services Node.js, 10 Java, 5 Go, 3 Python.
- Browser frontend (React) issues fetch calls to the API gateway.
- Kafka used for async messaging between 8 of the services.
- Service mesh: Linkerd (Linux), used for HTTP and gRPC east-west.
- Sampling: 100% errors, 100% slow (>2s), 1% baseline.
- On-call must be able to view any user request as a single trace within 30s of completion.
- W3C TraceContext + Baggage default everywhere; B3 only for legacy interop, deprecated.
- OTel SDK registered before app startup in every service; CI gate verifies this.
- Kafka, gRPC, mesh all carry traceparent automatically via auto-instrumentation.
- Async boundaries (setTimeout, workers, callbacks) require explicit context.bind discipline.
- Tail-sampling collector with load-balancing exporter for trace-id consistency.
- Sampling rules: 100% errors + 100% slow + 1% baseline.
- Propagation has its own observability layer (orphan-span rate, invalid-traceparent count).
- Healthy orphan-span rate (internal services)
- ≤1% of all spans
- Healthy invalid_traceparent rate
- ≤0.01%
- Healthy broken-parent rate
- ≤0.5%
- Alert threshold: internal service orphan rate
- >5% for 10 min
- GitHub 2022: orphan rate when regression discovered
- 50% (from 1% baseline)
- Time to detect GitHub regression without alerting
- >1 quarter
A batch processor pulls 1,000 messages from Kafka and processes them in one transaction. The engineer models this as one parent span with 1,000 child spans, one per message. After deploying, the tail-sampling Collector OOMs. What is the architectural fix?
A production team adds orphan-span-rate alerting. The alert fires for 'email-job' at 6% (baseline 0.5%). What is the first diagnostic step?
- 01Explain why span-links exist and when a senior engineer reaches for them instead of parent-child relationships.
- 02Describe three propagation health metrics that every production tracing deployment should monitor and the alert thresholds.
- 03Outline the 8-layer platform design for end-to-end propagation in a polyglot 30-service system with Kafka, service mesh, and tail sampling.
Production propagation failures are silent: Uber (30% broken traces for months), GitHub (50% orphan rate for a quarter), Slack (Collector OOM during an incident), and Datadog customers (80% background-task orphans) all failed this way. The shared pattern: dashboards show traces, just not connected ones, and no metric was alerting on the disconnection. The fix is to observe propagation health with its own RED-equivalent metrics — orphan-span rate by service, invalid-traceparent count, broken-parent rate — and alert on them. Span-links solve the cases the parent-child tree cannot: batch fan-in, async follow-ups, workflows wider than the decision window. Long-running traces must be broken into span-linked sub-traces that fit the Collector’s decision window. Propagation is the invisible foundation every other observability feature depends on; treat its health as a first-class production metric.
appears again in40
- Federation and lookahead: batching beyond DataLoadermiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- Production observability: LoAF, INP, and the full attack surfacesenior
- Hidden classes, transition trees, and memory layoutmiddle
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Worker pools, Comlink, and production observabilitysenior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- JSONB, arrays, and when a side table winsmiddle
- Schema integrity: deferral, versioning, and production failure modessenior
- Where data fetching happens — and why it decides LCPjunior
- React Server Components and Suspense streamingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- What TLS does and why it existsjunior
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- The twelve layers: one URL, seven actorsjunior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior