Observability
Trace propagation: multiple-choice review
Six questions that cut across the whole unit. Each one mirrors a call you make in a real incident — not a definition to recite, but a propagation decision to weigh while a dashboard lies to you about how much you can see.
Confirm you can connect the W3C header format, consistent sampling, tail-sampling collector economics, async-boundary failures, and propagation observability — the synthesis the individual lessons built toward.
A service receives 'traceparent: 99-aaaaaaaa-bbbbbbbb-01'. Per the W3C spec, what must it do, and why does this rule matter for the whole fleet?
A team runs probabilistic head sampling at 1% across 12 services with no central coordinator. How do all 12 independently agree to keep or drop the same trace?
A tail-sampling collector OOMs every few hours. Trace count is flat but spans-per-trace is climbing. What is the cause, and which 'fix' makes it worse?
A Node service uses OTel HTTP auto-instrumentation. Work deferred via setTimeout, and messages it publishes to Kafka, both show up as orphan traces. What is the correct two-part fix?
An internal service's orphan-span rate sits at 5% for a quarter while dashboards show full-looking traces. The team adds tail sampling hoping to clean it up. What is the core misunderstanding?
A consumer pulls 1,000 Kafka messages and processes them in one batch. Modelling this as one parent span with 1,000 children OOMs the collector. What is the idiomatic fix and what does it preserve?
The unit’s through-line is one chain: the 55-byte W3C traceparent stitches services together, an invalid header means start fresh, and a uniformly random trace-id lets every service hash the same keep/drop verdict for consistent sampling. The tail-sampling collector trades RAM (active-traces × spans × bytes × window) for outcome-aware selection, so long-running traces and fan-in must be broken up with span-links. Async boundaries (setTimeout needs context.bind, queues need inject/extract) are the leading source of orphans — and because propagation fails silently, the orphan-span rate is the only metric that tells you the dashboard is lying.