awesome-everything RU
↑ Back to the climb

Observability

Trace propagation: stitch a broken system into one trace

Crux Hands-on project — build a polyglot multi-service system, deliberately break propagation across HTTP, async, and Kafka boundaries, then make every user request one connected trace and prove it with an orphan-rate metric.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 240 min

Reading about orphan traces is not the same as pulling a fragmented system back into one picture. Build a small multi-service flow, watch it shatter into single-span orphans at the HTTP, async, and queue boundaries, then close every gap — and prove the fix with the one metric that does not lie.

Goal

Turn the unit’s mental model into a reproducible engineering loop: instrument propagation end-to-end, reproduce each class of orphan, fix it at the boundary (header, context.bind, inject/extract), and verify with an orphan-span-rate metric plus a tail-sampling tier that keeps every error trace.

Project
0 of 7
Objective

Build a multi-service request flow that crosses an HTTP hop, an in-process async boundary, and a Kafka queue, then make any user request appear as one connected trace within 30s of completion — driving the internal orphan-span rate from a deliberately broken baseline to under 1%, proven by measurement.

Requirements
Acceptance criteria
  • A before/after orphan-span-rate table per service: the broken baseline (each of the three boundaries shown producing orphans) versus the fixed state under 1% for internal services, measured from the metric, not estimated.
  • A backend screenshot or span dump of one user request rendered as a single connected trace — gateway, HTTP worker span, deferred-work span, and Kafka consumer span all sharing one trace-id with a correct parent_id chain.
  • Proof the tail-sampling tier keeps an injected error trace and a slow trace while dropping ~99% of baseline traffic, with the num_traces cap visible in the config and the load-balancing exporter routing by trace-id.
  • A one-paragraph write-up naming each orphan's root cause and the exact layer the fix belonged at (HTTP client wrapper, context.bind, inject/extract) — and why no amount of sampling could have repaired the lineage.
Senior stretch
  • Add a CI gate: an end-to-end test that drives a request through all services and asserts the resulting trace has the expected span count linked by one trace-id, failing the build if the orphan rate regresses.
  • Add a service mesh (Linkerd or Envoy) in front of the HTTP hop, enable mesh-hop spans, and show the three-span view (client app, sidecar, server app) — then prove the mesh still does not fix the Kafka orphan.
  • Add a browser frontend that issues the initial fetch with OTel-JS, and restrict traceparent propagation to same-origin and an explicit CORS allowlist so the header does not leak to third-party endpoints.
  • Reproduce the long-running-trace OOM: emit a trace that outlives the 30s decision window, watch the late spans become orphans, then refactor it into span-linked sub-traces and show the collector RAM stays bounded.
Recap

This is the loop you will run on every real propagation incident: instrument end-to-end, reproduce each orphan class at its boundary, and fix it where it is born — an OTel-aware HTTP client, context.bind across the in-process async gap, inject/extract across the queue — never at the dashboard or the sampler. Verify with the orphan-span rate, the one metric OTel will not surface for you, and run the kept traces through a capped, trace-id-routed tail-sampling tier. Doing it once on a toy system makes the production version, where the gap hides for a quarter, something you catch in an afternoon.

Continue the climb ↑Flame graphs: reading the picture that shows where time goes
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.