awesome-everything RU
↑ Back to the climb

Observability

Production propagation failures, span links, and platform design

Crux Propagation bugs are silent: Uber, GitHub, Slack, Datadog all had dashboards showing traces, just not the right ones. Span links solve fan-in and async follow-ups. Orphan rate and invalid-traceparent count are the meta-layer that catches regressions before customers notice.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 18 min

GitHub ran a propagation regression for a quarter where 50% of internal traces were orphans. The tracing dashboards showed traces the whole time. Nobody noticed until an engineer spot-checked the orphan-span rate in a routine review.

Real production propagation failures

Uber 2019: a partial OTel rollout caused 30% of traces to break at the boundary between instrumented and uninstrumented services. Postmortem mandated a “no service ships without W3C propagation” gate enforced in CI. The pattern: instrumented services emit perfect spans; uninstrumented services emit orphans; the two groups show different trace depths in dashboards, but there is no automated alert on the boundary failure.

GitHub 2022: a custom HTTP client wrapper bypassed OTel’s hooks and silently dropped traceparent across half their internal services for a quarter before someone noticed the orphan-span rate had risen from 1% to 50%. The fix was a single line to wrap the client in the OTel-aware version. The lesson: custom wrappers are the most common propagation gap in mature services. The fix is always a one-liner; finding it takes a quarter.

Slack 2023: tail-sampling Collectors OOMed and took down their tracing pipeline during a major incident — precisely when tracing was most needed. Postmortem added num_traces caps and a separate always-keep tier for high-priority traces. A monitoring gap: OTel Collector health metrics were not on any SLO dashboard.

Datadog 2024 customer report: a large Java workload had a thread pool that didn’t carry context across submitted tasks, so 80% of background-task traces were orphans. Fix: switch to a CurrentTraceContext-aware executor. The bug was present for months; it was discovered during a quarterly orphan-rate review.

The shared pattern: propagation bugs are silent. The dashboards keep showing traces. The only detection mechanism is a metric on orphan-span rate, and that metric must be on a dashboard and ideally on an alert — it is never surfaced automatically by OTel defaults.

Observability for propagation itself

The essential propagation health metrics:

MetricNormalSignal when
orphan_span_rate by service.name<1% (entry-points only)Internal service >5% → propagation regression
invalid_traceparent_received count~0Any sustained rate → broken upstream
trace_id_per_secondProportional to RPS × sample_rateSudden spike → fresh trace-ids (propagation lost)
broken_parent_count<0.5%Spans whose parent-id exists in no other span in the same trace
Healthy stateThresholdAlert action
Orphan spans for internal services<1%Page if >5% for 10 min for specific service
invalid_traceparent_received<0.01%Ticket if non-zero rate sustained >5 min
broken_parent_count<0.5%Ticket if >2% for 10 min

The parent-child model assumes linear causality: A calls B, B calls C. This breaks in three scenarios:

  1. Batch processing: a consumer pulls 1,000 messages from Kafka and processes them in one batch. There is no single meaningful “parent” — 1,000 incoming trace contexts feed one batch span.
  2. Fan-in: multiple parallel sub-jobs converge at a join point. Each sub-job is a child of its own branch; the join point has multiple causal contributors.
  3. Async follow-ups: the originating request finishes and returns to the user, but spawns follow-ups that execute hours later. The original request’s context is closed; the follow-ups need a causal link without being children of a dead span.

OTel’s span-links solve all of these: a span declares additional SpanContext references it is causally related to but does not descend from. Tracing backends visualise links as dotted lines alongside the solid parent-child tree.

The senior pattern: any trace longer than 30 seconds or wider than 100 spans is a candidate for span-link refactoring. Split the long workflow into sub-traces where each sub-trace fits within the tail-sampling decision window, and use links to preserve causal lineage. This keeps traces small, keeps the sampler happy, and preserves the investigation chain.

Long-running traces and the 30-minute problem

Tail samplers have decision windows of 30s–5min. A batch job running for 30 minutes emits spans long after the decision window closes; the late spans look like orphans to the sampler.

Two production patterns:

  1. Break the work: split long workflows into sub-traces linked via span-links, each fitting in the decision window. Clean architecture, correct by construction.
  2. Backend late-span support: Tempo, Honeycomb, and Datadog all support late-arriving spans up to 24h after trace start. Skip tail sampling for long traces; use head sampling at 100% for batch workloads. Practical retrofit for legacy batch jobs.

The decision window is the lever to adjust when batch workloads break tail sampling. Tuning it upward increases Collector RAM; the right answer is usually to break the work.

Trace it
1/5

A 0.5% orphan rate is detected for an internal service. Trace the root cause.

1
Step 1 of 5
Step 1: 0.5% orphan rate — is this normal or a signal?
2
Locked
Step 2: filter orphans by service.name. What's the pattern?
3
Locked
Step 3: one specific service is the source. Inspect inbound traffic — what to look for?
4
Locked
Step 4: traceparent is absent on requests from one upstream client. Why?
5
Locked
Step 5: durable fix?
Debug this

Diagnose a broken trace from tracing-backend output

log
# Query: trace_id == "4bf92f3577b34da6a3ce929d0e0e4736"
# Result: 7 spans total

#  service           span_id           parent_id           duration   status
1  api-gateway       1a2b3c4d5e6f7890  -                   18ms       OK
2  auth              7890abcdef123456  1a2b3c4d5e6f7890    14ms       OK
3  inventory         abcdef1234567890  1a2b3c4d5e6f7890    1200ms     OK
4  payment           fedcba0987654321  -                   80ms       OK    # ORPHAN
5  postgres-client   1111222233334444  fedcba0987654321    55ms       OK
6  email-job         5555666677778888  -                   240ms      OK    # ORPHAN
7  audit-log         9999aaaabbbbcccc  -                   12ms       OK    # ORPHAN

# Also separate orphan traces with single spans:
# trace_id 9981a... payment service, 78ms
# trace_id ab32c... email-job service, 280ms
# trace_id ff8e1... audit-log service, 14ms

The trace contains 7 spans but 3 are orphans (no parent_id) and 3 single-span orphan traces with the same service names exist. What's happening?

Design challenge

Design end-to-end trace propagation for a new platform with 30 microservices, browser frontend, Kafka backbone, a service mesh, and a tail-sampling collector tier.

  • Polyglot: 12 services Node.js, 10 Java, 5 Go, 3 Python.
  • Browser frontend (React) issues fetch calls to the API gateway.
  • Kafka used for async messaging between 8 of the services.
  • Service mesh: Linkerd (Linux), used for HTTP and gRPC east-west.
  • Sampling: 100% errors, 100% slow (>2s), 1% baseline.
  • On-call must be able to view any user request as a single trace within 30s of completion.
Propagation health thresholds
Healthy orphan-span rate (internal services)
≤1% of all spans
Healthy invalid_traceparent rate
≤0.01%
Healthy broken-parent rate
≤0.5%
Alert threshold: internal service orphan rate
>5% for 10 min
GitHub 2022: orphan rate when regression discovered
50% (from 1% baseline)
Time to detect GitHub regression without alerting
>1 quarter
Quiz

A batch processor pulls 1,000 messages from Kafka and processes them in one transaction. The engineer models this as one parent span with 1,000 child spans, one per message. After deploying, the tail-sampling Collector OOMs. What is the architectural fix?

Quiz

A production team adds orphan-span-rate alerting. The alert fires for 'email-job' at 6% (baseline 0.5%). What is the first diagnostic step?

Recall before you leave
  1. 01
    Explain why span-links exist and when a senior engineer reaches for them instead of parent-child relationships.
  2. 02
    Describe three propagation health metrics that every production tracing deployment should monitor and the alert thresholds.
  3. 03
    Outline the 8-layer platform design for end-to-end propagation in a polyglot 30-service system with Kafka, service mesh, and tail sampling.
Recap

Production propagation failures are silent: Uber (30% broken traces for months), GitHub (50% orphan rate for a quarter), Slack (Collector OOM during an incident), and Datadog customers (80% background-task orphans) all failed this way. The shared pattern: dashboards show traces, just not connected ones, and no metric was alerting on the disconnection. The fix is to observe propagation health with its own RED-equivalent metrics — orphan-span rate by service, invalid-traceparent count, broken-parent rate — and alert on them. Span-links solve the cases the parent-child tree cannot: batch fan-in, async follow-ups, workflows wider than the decision window. Long-running traces must be broken into span-linked sub-traces that fit the Collector’s decision window. Propagation is the invisible foundation every other observability feature depends on; treat its health as a first-class production metric.

Connected lessons
appears again in40
Continue the climb ↑Trace propagation: multiple-choice review
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.