Observability OBS · 06 · 01

What is trace propagation and why broken propagation is worse than none

Trace propagation passes one shared identifier across every service a request touches — miss a single hop and the trace silently splits into orphan fragments that look fine on dashboards but hide the real bottleneck.

OBS Junior ◷ 10 min

Level

FoundationsJuniorMiddleSenior

Already know this unit? Take a 1-minute quick check →

A customer opens a support ticket: “checkout took 30 seconds.” Your tracing tool shows traces for every service — but each one is a single span, unconnected to anything else. You have all the data and none of the answers.

What trace propagation is

Trace propagation is the practice of passing a small identifier from one service to the next on every request, so that all the work done across many services for one user action gets stitched into a single picture.

Without propagation, a slow checkout looks like 50 separate stories. With it — one trace, top-to-bottom, navigable in 30 seconds.

The identifier is carried in an HTTP header called traceparent, defined by the W3C Trace Context specification. Every service that receives a request reads the traceparent, uses it as the parent for its own work, generates a new span-id for itself, and writes a new traceparent before making any outbound call of its own. The trace-id stays constant across every hop; span-ids form a parent-child tree.

The relay-race metaphor

Think of an Amazon delivery with one tracking number. The package leaves a warehouse, hops between sorting facilities, rides on three different trucks, and finally arrives at your door. Each hop scans the same tracking number, recording where it was, when, and what the next hop is.

If any one stop forgets to scan, the tracking page goes silent and you have no idea where the package is — even if it eventually arrives.

Trace propagation is the scanning. Every service must:

Read the incoming traceparent (extract the trace-id and parent span-id).
Create its own span (new span-id, parent = the incoming span-id).
Write a new traceparent before any outbound call (same trace-id, its own span-id as the new parent-id).

Together these three steps maintain an unbroken causal chain — skip step 3 and the next service sees no traceparent, generates a fresh trace-id, and the whole downstream branch becomes invisible to anyone investigating the original request.

Miss any one of these steps and the chain breaks.

A creates span a1 and sends trace=t1. B continues t1 as a child (parent=a1), makes its own span b1, and forwards trace=t1 with parent=b1. C continues t1 as a child of b1. The trace-id (t1) is unbroken across every hop; each parent span-id becomes the next service's parent.

A concrete scenario with Bea and Sven

An on-call engineer gets a customer support ticket: “checkout took 30 seconds.” She opens her tracing tool, types the request-id from the support ticket, and pulls up one trace. She sees: 50 ms in the API gateway, 80 ms in the auth service, 28 seconds in the inventory service waiting on a database query, 200 ms in payment, 100 ms back to the user. The 28-second bottleneck is named precisely.

With one connected trace, the 28-second inventory query is instantly visible as the bottleneck — every other hop is under 200 ms. Broken propagation hides exactly this picture.

Without propagation she would have had to manually correlate 50 log entries across 7 services and guess which ones came from this user. With one trace she knows in 30 seconds.

Why broken propagation is worse than no tracing at all

Without any tracing, you know you have no traces and you fall back to logs. With broken propagation, every service emits spans but none link to each other — the dashboard claims you are observing the system, but each trace covers only one service.

When you see a dashboard full of single-span traces from internal services, that is not “some tracing is better than none” — that is broken propagation masquerading as coverage. You think you are debugging end-to-end and you are actually debugging in fragments. The missing trace makes the slow service invisible: a request that is fast in service A and slow in service B looks like a fast trace in A and a separate slow trace in B with no causal link. Operators waste hours suspecting the wrong service.

The common failure pattern: A team adds tracing to one service but forgets to enable OTel HTTP-client auto-instrumentation. Every span starts a fresh trace; the dashboard shows traces, but each is one-span-deep. Customers report slowness and the team cannot find where time went — the trace they need is silently split into 50 pieces.

Propagation state	What you see in the dashboard	What you can actually debug
No tracing at all	Nothing	Logs only — you know you’re guessing
Broken propagation	Traces everywhere, each 1 span deep	Nothing end-to-end — but the dashboard claims you can
Correct propagation	Full tree: API → auth → inventory → payment	Exact bottleneck in 30 seconds

Quiz

A trace is propagated across services by which HTTP header (in the W3C standard)?

Quiz

What is the most common production failure of trace propagation?

Order the steps

Order what happens when a request travels through three services with correct propagation:

1 Client A generates a new trace-id and a span-id, builds the traceparent header
2 Client A makes an HTTP request to Service B with the traceparent header
3 Service B extracts the trace-id, creates its own span (new span-id, parent = client's span-id)
4 Service B calls Service C: builds a fresh traceparent with the same trace-id but its own span-id as new parent-id
5 Service C extracts the trace-id, creates its own span (parent = B's span-id), does its work
6 Each service emits its span to the tracing backend; backend stitches by trace-id
7 Dashboard shows the full tree: A → B → C, each span sharing one trace-id

Complete the analogy

Fill in the blank: the standard HTTP header carrying the trace identifier across services is named _______.

Recall before you leave

01
In one paragraph: why is missing trace propagation worse than no tracing at all?
02
What three things must every service do when it receives a request with a traceparent header?
03
Name the three states of tracing and what each means for debuggability.

Recap

Trace propagation stitches all the work done for one user request across every service into a single navigable trace. The W3C Trace Context standard does this with a 55-byte traceparent HTTP header carrying a 128-bit trace-id that stays constant across every hop. Every service reads the incoming header, creates a child span, and writes a new header before its own outbound calls. Miss any one hop and the trace splits into disconnected single-span orphans — a state that is actively worse than no tracing because dashboards report normal visibility while hiding the real bottleneck from the engineer who needs it most. Now when you encounter a “30-second checkout” complaint, your first question is whether the trace shows a connected tree or a field of orphan singles — the difference tells you whether you have debugging data or debugging theatre.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

SLI, SLO, and the error budget: reliability by the numbersjunior

unlocks

deepens into

appears again in40

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.Mini OAuth 2.0 + PKCE loginImplement the authorization-code + PKCE flow end to end against a real provider, so you understand every redirect and token instead of trusting a library.