Observability OBS · 06 · 07

Production propagation failures, span links, and platform design

Propagation bugs are silent: Uber, GitHub, Slack, Datadog all had dashboards showing traces, just not the right ones. Span links solve fan-in and async follow-ups. Orphan rate and invalid-traceparent count are the meta-layer that catches regressions before customers notice.

OBS Senior ◷ 18 min

Level

FoundationsJuniorMiddleSenior

GitHub ran a propagation regression for a quarter where 50% of internal traces were orphans. The tracing dashboards showed traces the whole time. Nobody noticed until an engineer spot-checked the orphan-span rate in a routine review.

Real production propagation failures

Uber 2019: a partial OTel rollout caused 30% of traces to break at the boundary between instrumented and uninstrumented services. Postmortem mandated a “no service ships without W3C propagation” gate enforced in CI. The pattern: instrumented services emit perfect spans; uninstrumented services emit orphans; the two groups show different trace depths in dashboards, but there is no automated alert on the boundary failure.

GitHub 2022: a custom HTTP client wrapper bypassed OTel’s hooks and silently dropped traceparent across half their internal services for a quarter before someone noticed the orphan-span rate had risen from 1% to 50%. The fix was a single line to wrap the client in the OTel-aware version. The lesson: custom wrappers are the most common propagation gap in mature services. The fix is always a one-liner; finding it takes a quarter.

Slack 2023: tail-sampling Collectors OOMed and took down their tracing pipeline during a major incident — precisely when tracing was most needed. Postmortem added num_traces caps and a separate always-keep tier for high-priority traces. A monitoring gap: OTel Collector health metrics were not on any SLO dashboard.

Datadog 2024 customer report: a large Java workload had a thread pool that didn’t carry context across submitted tasks, so 80% of background-task traces were orphans. Fix: switch to a CurrentTraceContext-aware executor. The bug was present for months; it was discovered during a quarterly orphan-rate review.

The shared pattern: propagation bugs are silent. The dashboards keep showing traces. The only detection mechanism is a metric on orphan-span rate, and that metric must be on a dashboard and ideally on an alert — it is never surfaced automatically by OTel defaults.

These are not 2% blips: 30–80% of traces were silently disconnected while dashboards kept showing traces, undetected for months to a full quarter. The size of the gap from the ≤1% healthy floor is exactly why orphan-span rate needs its own alert.

Top row (broken): the custom client drops traceparent, so payment finds no incoming context and starts a fresh trace T2 — its span is an orphan, the parent link to auth is severed, and one logical request now appears as two disconnected traces. Bottom row (fix): swap in the OTel-aware client so it forwards the header; payment continues trace T1 with parent=auth.

Observability for propagation itself

The essential propagation health metrics:

Metric	Normal	Signal when
`orphan_span_rate` by `service.name`	<1% (entry-points only)	Internal service >5% → propagation regression
`invalid_traceparent_received` count	~0	Any sustained rate → broken upstream
`trace_id_per_second`	Proportional to RPS × sample_rate	Sudden spike → fresh trace-ids (propagation lost)
`broken_parent_count`	<0.5%	Spans whose parent-id exists in no other span in the same trace

Healthy state	Threshold	Alert action
Orphan spans for internal services	<1%	Page if >5% for 10 min for specific service
invalid_traceparent_received	<0.01%	Ticket if non-zero rate sustained >5 min
broken_parent_count	<0.5%	Ticket if >2% for 10 min

Span links: when the parent-child tree breaks down

The parent-child model assumes linear causality: A calls B, B calls C. This breaks in three scenarios:

Batch processing: a consumer pulls 1,000 messages from Kafka and processes them in one batch. There is no single meaningful “parent” — 1,000 incoming trace contexts feed one batch span.
Fan-in: multiple parallel sub-jobs converge at a join point. Each sub-job is a child of its own branch; the join point has multiple causal contributors.
Async follow-ups: the originating request finishes and returns to the user, but spawns follow-ups that execute hours later. The original request’s context is closed; the follow-ups need a causal link without being children of a dead span.

OTel’s span-links solve all of these: a span declares additional SpanContext references it is causally related to but does not descend from. Tracing backends visualise links as dotted lines alongside the solid parent-child tree.

The senior pattern: any trace longer than 30 seconds or wider than 100 spans is a candidate for span-link refactoring. Split the long workflow into sub-traces where each sub-trace fits within the tail-sampling decision window, and use links to preserve causal lineage. This keeps traces small, keeps the sampler happy, and preserves the investigation chain.

Long-running traces and the 30-minute problem

Tail samplers have decision windows of 30s–5min. A batch job running for 30 minutes emits spans long after the decision window closes; the late spans look like orphans to the sampler.

Two production patterns:

Break the work: split long workflows into sub-traces linked via span-links, each fitting in the decision window. Clean architecture, correct by construction.
Backend late-span support: Tempo, Honeycomb, and Datadog all support late-arriving spans up to 24h after trace start. Skip tail sampling for long traces; use head sampling at 100% for batch workloads. Practical retrofit for legacy batch jobs.

The decision window is the lever to adjust when batch workloads break tail sampling. Tuning it upward increases Collector RAM; the right answer is usually to break the work.

Trace it

1/5

A 0.5% orphan rate is detected for an internal service. Trace the root cause.

Step 1 of 5

Step 1: 0.5% orphan rate — is this normal or a signal?

Locked

Step 2: filter orphans by service.name. What's the pattern?

Locked

Step 3: one specific service is the source. Inspect inbound traffic — what to look for?

Locked

Step 4: traceparent is absent on requests from one upstream client. Why?

Locked

Step 5: durable fix?

Debug this

Diagnose a broken trace from tracing-backend output

log

# Query: trace_id == "4bf92f3577b34da6a3ce929d0e0e4736"
# Result: 7 spans total

#  service           span_id           parent_id           duration   status
1  api-gateway       1a2b3c4d5e6f7890  -                   18ms       OK
2  auth              7890abcdef123456  1a2b3c4d5e6f7890    14ms       OK
3  inventory         abcdef1234567890  1a2b3c4d5e6f7890    1200ms     OK
4  payment           fedcba0987654321  -                   80ms       OK    # ORPHAN
5  postgres-client   1111222233334444  fedcba0987654321    55ms       OK
6  email-job         5555666677778888  -                   240ms      OK    # ORPHAN
7  audit-log         9999aaaabbbbcccc  -                   12ms       OK    # ORPHAN

# Also separate orphan traces with single spans:
# trace_id 9981a... payment service, 78ms
# trace_id ab32c... email-job service, 280ms
# trace_id ff8e1... audit-log service, 14ms

The trace contains 7 spans but 3 are orphans (no parent_id) and 3 single-span orphan traces with the same service names exist. What's happening?

Design challenge

Design end-to-end trace propagation for a new platform with 30 microservices, browser frontend, Kafka backbone, a service mesh, and a tail-sampling collector tier.

Polyglot: 12 services Node.js, 10 Java, 5 Go, 3 Python.
Browser frontend (React) issues fetch calls to the API gateway.
Kafka used for async messaging between 8 of the services.
Service mesh: Linkerd (Linux), used for HTTP and gRPC east-west.
Sampling: 100% errors, 100% slow (>2s), 1% baseline.
On-call must be able to view any user request as a single trace within 30s of completion.

Reference answer

Layer 1 — propagator standardisation. Mandate OpenTelemetry SDKs everywhere; default propagator is the W3C TraceContext + Baggage composite. Drop B3 for outbound; accept B3 on inbound during a 90-day deprecation. Per-language SDK choice: opentelemetry-js (Node and browser), opentelemetry-java (Java services), opentelemetry-go (Go), opentelemetry-python (Python). Browser: use opentelemetry-js-browser SDK with fetch instrumentation; restrict propagation to same-origin and explicit-allowlist CORS origins so traceparent doesn't leak to third parties. Layer 2 — HTTP propagation. Auto-instrumentation on every service: register OTel before app code starts (Node: -r flag; Java: javaagent; Go: explicit init; Python: opentelemetry-instrument). Service mesh (Linkerd) propagates headers transparently and emits its own mesh-hop spans for full network visibility. Layer 3 — Kafka propagation. Producer side: OTel Kafka instrumentation injects traceparent into record headers. Consumer side: extracts traceparent on poll, starts a new span with parent = producer's span-id. Use span-links rather than direct parent-child when one consumer batches 100 messages — link to all incoming contexts, create one batch-span. Layer 4 — async-boundary discipline. Code review checklist: any callback, setTimeout, setImmediate, queueMicrotask, worker dispatch, or Promise chain crossing a logical boundary must be wrapped with context.bind or equivalent. Custom thread pools must use OTel context-aware executors. Linter rule flags raw setTimeout in HTTP-handler scope. Layer 5 — collector tier. OpenTelemetry Collector deployed as a stateless agent layer (1 per node, DaemonSet) plus a stateful tail-sampling tier (5 replicas) behind a load-balancing exporter (trace-id hash). Decision window 30s. Memory budget per tail-sampler: 4 GB at expected 10k req/s aggregate. Policies: keep all errors, keep all traces >2s end-to-end, 1% probabilistic for everything else. Layer 6 — backend. Tracing backend (Tempo, Honeycomb, Datadog, Jaeger) ingests sampled spans; retention 7 days fine-grained, 30 days sampled summary; long-term archival to object storage. Layer 7 — observability of propagation. Metrics: orphan_span_rate by service, invalid_traceparent_count, broken_parent_count, trace_id_per_second. Alert when orphan rate for a specific internal service exceeds 5% for 10 minutes. Dashboard panel shows propagation health alongside RED + USE. Layer 8 — CI tests. End-to-end propagation test: chain of 5 services, send a request, assert >4 spans linked by one trace-id. Run on every PR touching HTTP, Kafka, or worker code.

Should cover

W3C TraceContext + Baggage default everywhere; B3 only for legacy interop, deprecated.
OTel SDK registered before app startup in every service; CI gate verifies this.
Kafka, gRPC, mesh all carry traceparent automatically via auto-instrumentation.
Async boundaries (setTimeout, workers, callbacks) require explicit context.bind discipline.
Tail-sampling collector with load-balancing exporter for trace-id consistency.
Sampling rules: 100% errors + 100% slow + 1% baseline.
Propagation has its own observability layer (orphan-span rate, invalid-traceparent count).

Propagation health thresholds

Healthy orphan-span rate (internal services): ≤1% of all spans
Healthy invalid_traceparent rate: ≤0.01%
Healthy broken-parent rate: ≤0.5%
Alert threshold: internal service orphan rate: >5% for 10 min
GitHub 2022: orphan rate when regression discovered: 50% (from 1% baseline)
Time to detect GitHub regression without alerting: >1 quarter

Quiz

A batch processor pulls 1,000 messages from Kafka and processes them in one transaction. The engineer models this as one parent span with 1,000 child spans, one per message. After deploying, the tail-sampling Collector OOMs. What is the architectural fix?

Quiz

A production team adds orphan-span-rate alerting. The alert fires for 'email-job' at 6% (baseline 0.5%). What is the first diagnostic step?

Recall before you leave

01
Explain why span-links exist and when a senior engineer reaches for them instead of parent-child relationships.
02
Describe three propagation health metrics that every production tracing deployment should monitor and the alert thresholds.
03
Outline the 8-layer platform design for end-to-end propagation in a polyglot 30-service system with Kafka, service mesh, and tail sampling.

Recap

Production propagation failures are silent: Uber (30% broken traces for months), GitHub (50% orphan rate for a quarter), Slack (Collector OOM during an incident), and Datadog customers (80% background-task orphans) all failed this way. The shared pattern: dashboards show traces, just not connected ones, and no metric was alerting on the disconnection. The fix is to observe propagation health with its own RED-equivalent metrics — orphan-span rate by service, invalid-traceparent count, broken-parent rate — and alert on them. Span-links solve the cases the parent-child tree cannot: batch fan-in, async follow-ups, workflows wider than the decision window. Long-running traces must be broken into span-linked sub-traces that fit the Collector’s decision window. Propagation is the invisible foundation every other observability feature depends on; treat its health as a first-class production metric. Now when you inherit a new service or review a tracing setup, your first question is “what is the orphan-span rate?” — because that single number tells you whether the propagation foundation is solid or quietly crumbling.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

Async context per language, service mesh, B3 migration, and securitysenior

unlocks

Flame graphs: reading the picture that shows where time goesjunior

appears again in40

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.