Observability
Baggage and async boundaries: carrying context across queues and callbacks
A developer writes setTimeout(() => doWork(), 100) inside a request handler. The deferred work runs fine, emits spans, and shows up in the tracing dashboard — as a completely separate trace with no connection to the request that triggered it.
Baggage: arbitrary key-values across the journey
Baggage is a second header (baggage) defined by W3C, propagated identically to traceparent. It carries free-form key-value pairs: tenant-id, region, feature-flag selection, A/B test cohort. Baggage is read by downstream services and can be attached as span attributes or used by code (for example, to enforce per-tenant quotas).
The catch: baggage is unbounded by spec, sized only by the carrier protocol. HTTP header limits (typically 8–64 KB per header field, server-configurable) are the practical cap.
Two real risks:
-
Latency bloat. Baggage is propagated on every request hop, every queue message, every async boundary. Each hop pays the cost twice: serialise on outbound, parse on inbound. A 16-KB baggage payload on 10k req/s of internal traffic means roughly 160 MB/s of pure baggage transport plus parsing time at every hop.
-
PII leakage. Baggage flows to every downstream service including third-party integrations and SaaS APIs that might log or persist headers — often in less-audited contexts than the source database. Anything in baggage is effectively semi-public within and beyond your engineering org.
Discipline: baggage holds operational tags, never identity data.
What to put in baggage: tenant-id (opaque token), region, feature-flag state, A/B cohort name, request-source label. What never to put in baggage: user emails, credit-card or payment tokens, API keys, auth tokens, customer PII, internal secrets.
| Safe in baggage | Never in baggage |
|---|---|
| tenant-id (opaque token) | user email or full user-id |
| region / availability-zone | credit-card or payment token |
feature-flag state (e.g. checkout-v2=true) | API keys or auth tokens |
| A/B cohort name | any customer PII |
Async boundaries: the killer
HTTP propagation is the easy case — OTel auto-instrumentation handles it for supported HTTP servers and clients. The hard cases are queues, timers, callbacks.
The Node.js setTimeout trap. The in-process “current context” in Node.js flows through the call stack automatically, but setTimeout, setImmediate, and queueMicrotask schedule work to run later in a different call stack. When the callback fires, the original request’s context is gone. OTel creates a new root span (a fresh trace-id) for the deferred work. The result: a trace that ends abruptly at the setTimeout call site, and a separate orphan trace for the deferred work.
The fix: wrap the callback with context.bind(ctx, fn) or run it inside context.with(ctx, fn):
// BROKEN: traceparent lost across the setTimeout
app.post('/checkout', async (req, res) => {
setTimeout(() => {
doDeferredWork(); // new trace; not linked to request
}, 100);
res.status(202).end();
});
// FIXED: bind context across the boundary
app.post('/checkout', async (req, res) => {
const ctx = context.active();
setTimeout(context.bind(ctx, () => {
doDeferredWork(); // now sees the request's trace-id
}), 100);
res.status(202).end();
});context.bind(ctx, fn) returns a wrapped function that restores the captured context when invoked. The same pattern applies to setImmediate, queueMicrotask, callbacks passed to third-party libraries, and any custom thread or worker abstraction.
Kafka and message-queue propagation
When work crosses a process boundary via a message queue (Kafka, RabbitMQ, SQS, Cloud Pub/Sub), the in-process context cannot carry across. The trace-id must be written into the message’s wire format and read back on the other side:
- Producer side: call
propagator.inject(context, message.headers)before publishing. This writestraceparent(andbaggage) into the message’s header map. - Consumer side: call
context = propagator.extract(message.headers)after polling. This restores the trace-id, allowing the consumer to create a child span linked to the producer’s span.
OTel’s Kafka instrumentation handles both sides automatically — but only if the instrumentation library is loaded and configured.
The critical distinction: context.bind keeps in-process context alive across an async boundary in the producer, but it does not write traceparent into the message. Kafka lives in a separate process. The consumer’s process has no shared memory with the producer; the only way to carry the trace-id is to inject it into the message headers before publishing.
Fix a broken setTimeout propagation in Node.js
1/3A service places a user's email address in the baggage header to pass it to downstream services. What is the risk?
A Node.js service uses OTel auto-instrumentation and Kafka. Traces arriving from Kafka consumers are disconnected orphans. What is the most likely cause?
- 01Explain why baggage size discipline matters in production, and what should never go into baggage.
- 02What is the difference between context.bind (in-process) and inject/extract (cross-process) for trace propagation?
- 03Name four async boundaries in Node.js that require explicit context propagation and explain the fix for each.
The W3C baggage header propagates operational key-values (tenant-id, feature flags, A/B cohort) alongside the trace, but every hop pays serialise and parse costs, and the header flows to every downstream including third-party integrations. PII or secrets in baggage are semi-public within and beyond the engineering org. Async boundaries are the leading cause of split traces in production: setTimeout, setImmediate, and queueMicrotask fire in a different call stack with no OTel context; fix them with context.bind or context.with. Queue consumers (Kafka, RabbitMQ, SQS) cross a process boundary — the fix is propagator.inject() into message headers at the producer and propagator.extract() at the consumer; context.bind alone is not enough.
- Sampling consistency and the tail-sampling Collector tiersenior
- Async context per language, service mesh, B3 migration, and securitysenior
- Production propagation failures, span links, and platform designsenior
- Trace propagation: stitch a broken system into one tracesenior
- Trace propagation: multiple-choice reviewsenior
- Trace propagation: code and header readingsenior
- Trace propagation: free-recall reviewsenior
appears again in40
- Federation and lookahead: batching beyond DataLoadermiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- Production observability: LoAF, INP, and the full attack surfacesenior
- Hidden classes, transition trees, and memory layoutmiddle
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Worker pools, Comlink, and production observabilitysenior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- JSONB, arrays, and when a side table winsmiddle
- Schema integrity: deferral, versioning, and production failure modessenior
- Where data fetching happens — and why it decides LCPjunior
- React Server Components and Suspense streamingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- What TLS does and why it existsjunior
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- The twelve layers: one URL, seven actorsjunior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior