Observability
Trace propagation: stitch a broken system into one trace
Reading about orphan traces is not the same as pulling a fragmented system back into one picture. Build a small multi-service flow, watch it shatter into single-span orphans at the HTTP, async, and queue boundaries, then close every gap — and prove the fix with the one metric that does not lie.
Turn the unit’s mental model into a reproducible engineering loop: instrument propagation end-to-end, reproduce each class of orphan, fix it at the boundary (header, context.bind, inject/extract), and verify with an orphan-span-rate metric plus a tail-sampling tier that keeps every error trace.
Build a multi-service request flow that crosses an HTTP hop, an in-process async boundary, and a Kafka queue, then make any user request appear as one connected trace within 30s of completion — driving the internal orphan-span rate from a deliberately broken baseline to under 1%, proven by measurement.
- A before/after orphan-span-rate table per service: the broken baseline (each of the three boundaries shown producing orphans) versus the fixed state under 1% for internal services, measured from the metric, not estimated.
- A backend screenshot or span dump of one user request rendered as a single connected trace — gateway, HTTP worker span, deferred-work span, and Kafka consumer span all sharing one trace-id with a correct parent_id chain.
- Proof the tail-sampling tier keeps an injected error trace and a slow trace while dropping ~99% of baseline traffic, with the num_traces cap visible in the config and the load-balancing exporter routing by trace-id.
- A one-paragraph write-up naming each orphan's root cause and the exact layer the fix belonged at (HTTP client wrapper, context.bind, inject/extract) — and why no amount of sampling could have repaired the lineage.
- Add a CI gate: an end-to-end test that drives a request through all services and asserts the resulting trace has the expected span count linked by one trace-id, failing the build if the orphan rate regresses.
- Add a service mesh (Linkerd or Envoy) in front of the HTTP hop, enable mesh-hop spans, and show the three-span view (client app, sidecar, server app) — then prove the mesh still does not fix the Kafka orphan.
- Add a browser frontend that issues the initial fetch with OTel-JS, and restrict traceparent propagation to same-origin and an explicit CORS allowlist so the header does not leak to third-party endpoints.
- Reproduce the long-running-trace OOM: emit a trace that outlives the 30s decision window, watch the late spans become orphans, then refactor it into span-linked sub-traces and show the collector RAM stays bounded.
This is the loop you will run on every real propagation incident: instrument end-to-end, reproduce each orphan class at its boundary, and fix it where it is born — an OTel-aware HTTP client, context.bind across the in-process async gap, inject/extract across the queue — never at the dashboard or the sampler. Verify with the orphan-span rate, the one metric OTel will not surface for you, and run the kept traces through a capped, trace-id-routed tail-sampling tier. Doing it once on a toy system makes the production version, where the gap hides for a quarter, something you catch in an afternoon.