Observability
What is trace propagation and why broken propagation is worse than none
A customer opens a support ticket: “checkout took 30 seconds.” Your tracing tool shows traces for every service — but each one is a single span, unconnected to anything else. You have all the data and none of the answers.
What trace propagation is
Trace propagation is the practice of passing a small identifier from one service to the next on every request, so that all the work done across many services for one user action gets stitched into a single picture.
Without propagation, a slow checkout looks like 50 separate stories. With it — one trace, top-to-bottom, navigable in 30 seconds.
The identifier is carried in an HTTP header called traceparent, defined by the W3C Trace Context specification. Every service that receives a request reads the traceparent, uses it as the parent for its own work, generates a new span-id for itself, and writes a new traceparent before making any outbound call of its own. The trace-id stays constant across every hop; span-ids form a parent-child tree.
The relay-race metaphor
Think of an Amazon delivery with one tracking number. The package leaves a warehouse, hops between sorting facilities, rides on three different trucks, and finally arrives at your door. Each hop scans the same tracking number, recording where it was, when, and what the next hop is.
If any one stop forgets to scan, the tracking page goes silent and you have no idea where the package is — even if it eventually arrives.
Trace propagation is the scanning. Every service must:
- Read the incoming
traceparent(extract the trace-id and parent span-id). - Create its own span (new span-id, parent = the incoming span-id).
- Write a new
traceparentbefore any outbound call (same trace-id, its own span-id as the new parent-id).
Miss any one of these steps and the chain breaks.
A concrete scenario with Bea and Sven
An on-call engineer gets a customer support ticket: “checkout took 30 seconds.” She opens her tracing tool, types the request-id from the support ticket, and pulls up one trace. She sees: 50 ms in the API gateway, 80 ms in the auth service, 28 seconds in the inventory service waiting on a database query, 200 ms in payment, 100 ms back to the user. The 28-second bottleneck is named precisely.
Without propagation she would have had to manually correlate 50 log entries across 7 services and guess which ones came from this user. With one trace she knows in 30 seconds.
Why broken propagation is worse than no tracing at all
Without any tracing, you know you have no traces and you fall back to logs. With broken propagation, every service emits spans but none link to each other — the dashboard claims you are observing the system, but each trace covers only one service.
You think you are debugging end-to-end and you are actually debugging in fragments. The missing trace makes the slow service invisible: a request that is fast in service A and slow in service B looks like a fast trace in A and a separate slow trace in B with no causal link. Operators waste hours suspecting the wrong service.
The common failure pattern: A team adds tracing to one service but forgets to enable OTel HTTP-client auto-instrumentation. Every span starts a fresh trace; the dashboard shows traces, but each is one-span-deep. Customers report slowness and the team cannot find where time went — the trace they need is silently split into 50 pieces.
| Propagation state | What you see in the dashboard | What you can actually debug |
|---|---|---|
| No tracing at all | Nothing | Logs only — you know you’re guessing |
| Broken propagation | Traces everywhere, each 1 span deep | Nothing end-to-end — but the dashboard claims you can |
| Correct propagation | Full tree: API → auth → inventory → payment | Exact bottleneck in 30 seconds |
A trace is propagated across services by which HTTP header (in the W3C standard)?
What is the most common production failure of trace propagation?
Order what happens when a request travels through three services with correct propagation:
- 1 Client A generates a new trace-id and a span-id, builds the traceparent header
- 2 Client A makes an HTTP request to Service B with the traceparent header
- 3 Service B extracts the trace-id, creates its own span (new span-id, parent = client's span-id)
- 4 Service B calls Service C: builds a fresh traceparent with the same trace-id but its own span-id as new parent-id
- 5 Service C extracts the trace-id, creates its own span (parent = B's span-id), does its work
- 6 Each service emits its span to the tracing backend; backend stitches by trace-id
- 7 Dashboard shows the full tree: A → B → C, each span sharing one trace-id
Fill in the blank: the standard HTTP header carrying the trace identifier across services is named _______.
- 01In one paragraph: why is missing trace propagation worse than no tracing at all?
- 02What three things must every service do when it receives a request with a traceparent header?
- 03Name the three states of tracing and what each means for debuggability.
Trace propagation stitches all the work done for one user request across every service into a single navigable trace. The W3C Trace Context standard does this with a 55-byte traceparent HTTP header carrying a 128-bit trace-id that stays constant across every hop. Every service reads the incoming header, creates a child span, and writes a new header before its own outbound calls. Miss any one hop and the trace splits into disconnected single-span orphans — a state that is actively worse than no tracing because dashboards report normal visibility while hiding the real bottleneck from the engineer who needs it most.
appears again in40
- Federation and lookahead: batching beyond DataLoadermiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- Production observability: LoAF, INP, and the full attack surfacesenior
- Hidden classes, transition trees, and memory layoutmiddle
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Worker pools, Comlink, and production observabilitysenior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- JSONB, arrays, and when a side table winsmiddle
- Schema integrity: deferral, versioning, and production failure modessenior
- Where data fetching happens — and why it decides LCPjunior
- React Server Components and Suspense streamingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- What TLS does and why it existsjunior
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- The twelve layers: one URL, seven actorsjunior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior