Observability
Async context per language, service mesh, B3 migration, and security
A Java service migrates from thread-pool-based request handling to Project Loom virtual threads. After the migration, 30% of traces are orphans. The OTel SDK version didn’t change. What broke?
Async context propagation: language by language
Each runtime has its own mechanism for “what is the current execution context?” — and OTel hooks into that mechanism. When you cross a primitive the runtime doesn’t carry context through automatically, context is silently lost.
Node.js: AsyncLocalStorage (built into Node 12+) is the substrate. OTel hooks into it for in-process propagation. Pitfalls: setTimeout, setImmediate, queueMicrotask, and any third-party promise wrapping that creates a new AsyncLocalStorage context can lose trace context. OTel auto-instrumentation patches the common ones, but custom libraries break it. Fix: context.bind(ctx, fn) before passing a callback to any async boundary.
Python: contextvars from PEP 567 is the substrate; works automatically for asyncio but not threading without ContextVar.copy(). If you spawn threads manually, the OTel context from the parent thread is not inherited. Fix: pass the current context to the child thread and set it via context.attach(ctx) on entry.
Java: traditional ThreadLocal plus a Span.makeCurrent() try-with-resources block. Project Loom virtual threads are mostly transparent for simple cases, but require careful Scope nesting — if a virtual thread is created inside a Scope, the scope must outlive the virtual thread, otherwise the context is closed while the thread is still running. Fix: structure virtual-thread creation to happen inside the scope, not outside it.
Go: explicit context.Context plumbing — every function takes ctx, every span lives in ctx. Go made the right architectural choice early; context never flows implicitly. The failure mode in Go is not losing context accidentally, but forgetting to thread ctx through a function call chain. Fix: pass ctx everywhere; use go vet and staticcheck to flag missing context parameters.
Browsers: zone.js (Angular’s solution for patching async primitives), or the TC39 AsyncContext proposal; OTel-JS supports both via plugins. Service workers and Web Workers require explicit context pass-through.
| Runtime | Context substrate | Common failure mode | Fix |
|---|---|---|---|
| Node.js | AsyncLocalStorage | setTimeout / custom async wrappers | context.bind(ctx, fn) |
| Python | contextvars (PEP 567) | Manual threading bypasses asyncio | ContextVar.copy() on thread spawn |
| Java | ThreadLocal + Scope | Loom virtual threads outlive Scope | Create virtual thread inside Scope; use context-aware executor |
| Go | Explicit context.Context | ctx not threaded through a function | Pass ctx everywhere; vet/staticcheck |
B3 vs W3C: the migration story and safe sequence
Before W3C Trace Context, Twitter’s Zipkin/Brave used B3, with two variants: B3 multi-header (X-B3-TraceId, X-B3-SpanId, X-B3-Sampled as separate headers) and B3 single-header (all three combined in one). B3’s original trace-id was 64-bit; it was extended to 128-bit for W3C compatibility.
Safe migration sequence:
- Audit: identify every service still emitting B3-only headers.
- Deploy composite propagator everywhere: register W3C TraceContext + B3 multi + B3 single at every service. Write outbound W3C only; extract inbound from both. This is the “read-both, write-W3C” phase.
- Verify: confirm orphan-span rate doesn’t regress.
- Phase out B3 outbound: after downstream services are confirmed to read W3C, disable B3 outbound at each upstream.
- Remove B3 extractor: after a quarter at zero B3 inbound, replace the composite propagator with W3C-only.
What goes wrong if steps are skipped:
- Skipping step 2: upstream sends W3C while downstream only reads B3 → traces split.
- Skipping step 3: a propagation regression goes silent for weeks.
- Skipping step 5: double header bytes per request indefinitely.
Trace context across service mesh
Envoy, Linkerd, Istio, Cilium data planes participate in tracing two ways:
- Pass-through (always): the sidecar forwards
traceparent/tracestate/baggageheaders on every HTTP and gRPC request transparently. - Emit their own spans (optional but recommended): when enabled, the sidecar creates a span for the network hop, showing sidecar latency, connection pooling, and TLS handshake timing distinct from application latency.
Configuration: the mesh proxy needs the tracing-collector address and the sampling decision (usually inherit the incoming flag). The mesh’s sampling decision must agree with the application’s; mismatched rates produce inconsistent traces.
The limit: service mesh only handles HTTP and gRPC. Queue consumers, timers, and fire-and-forget callbacks still require explicit application-level propagation. The mesh is not a substitute for OTel SDK instrumentation; it adds a network-hop span, it does not replace application spans.
Why this works
When the mesh emits its own span, you gain a three-span view of one HTTP call: client app, mesh sidecar, server app. This lets you distinguish “the application was slow” from “the mesh was slow” — a critical distinction during incidents involving sidecar upgrades, connection pool exhaustion, or mTLS certificate renewal storms.
Security: the trace-id as a tracking identifier
Trace-ids are unique per request, 128 bits of entropy, propagated in HTTP headers and visible to anyone who can inspect traffic between the client and the origin. This makes them powerful debugging tools and equally powerful potential trackers.
The risk: if a third party (a CDN, a marketing pixel, a CSP-allowed analytics service) can read the traceparent header from the user’s outbound requests, it can correlate user activity across sites that share the same tracing infrastructure.
Mitigations:
- The W3C spec recommends that user-facing services do not propagate
traceparentin responses (responses are to the user, not part of an upstream call). - Browser-side OTel SDKs should limit propagation to same-origin and explicitly-allowed CORS origins (
TraceContextPropagator.allowedOriginslist). - Production teams maintain an allowlist of downstream hostnames that receive
traceparentand audit it quarterly. - Baggage applies identically — anything in baggage is observable by every downstream including third parties.
- Default OTel propagator
- TraceContext + Baggage composite
- B3 single-header trace-id width (original)
- 64 bits (later extended to 128)
- Service-mesh sidecar tracing overhead
- ~1–2% extra CPU
- Per-request header bytes (traceparent + tracestate small)
- ~80–200 bytes
- W3C Trace Context Level 1
- Recommendation 2020-02
- W3C Trace Context Level 2
- Recommendation 2024
A team migrates from B3 to W3C propagation. They deploy W3C-write on upstream services before deploying W3C-read on downstream services. What happens?
A service mesh (Envoy) is configured to propagate traceparent and emit its own mesh-hop spans. After enabling it, the team still sees orphan traces for some Kafka consumers. Why?
- 01A Java service migrates to Project Loom virtual threads and orphan-span rate spikes to 30%. Diagnose and fix.
- 02Describe the safe 5-step sequence for migrating from B3 to W3C TraceContext and what breaks if you skip step 2.
- 03What is the traceparent privacy risk in browser applications and what are the three mitigations?
Context propagation in each runtime hooks into a different substrate: Node’s AsyncLocalStorage, Python’s contextvars, Java’s ThreadLocal plus Scope, Go’s explicit context.Context. Each has its own failure mode when you cross a primitive the runtime doesn’t auto-carry — the fix is always explicit context binding at that boundary. B3 migration to W3C requires read-both before write-W3C, verified by orphan-rate monitoring. Service mesh passes traceparent for HTTP/gRPC transparently and can emit mesh-hop spans, but does not instrument queue consumers — those still require application-level inject/extract. The traceparent header in browser requests is a tracking vector if propagated to cross-origin third parties; restrict it to same-origin and explicitly-allowed CORS origins.
appears again in40
- Federation and lookahead: batching beyond DataLoadermiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- Production observability: LoAF, INP, and the full attack surfacesenior
- Hidden classes, transition trees, and memory layoutmiddle
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Worker pools, Comlink, and production observabilitysenior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- JSONB, arrays, and when a side table winsmiddle
- Schema integrity: deferral, versioning, and production failure modessenior
- Where data fetching happens — and why it decides LCPjunior
- React Server Components and Suspense streamingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- What TLS does and why it existsjunior
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- The twelve layers: one URL, seven actorsjunior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior