awesome-everything RU
↑ Back to the climb

Networking & Protocols

Observability: distributed traces, USE/RED, and sampling

Crux OpenTelemetry W3C Trace Context propagation reveals which hop ate the 500 ms; USE and RED methods discipline instrumentation; head-based vs tail-based sampling controls the cost of capturing traces at 1M req/s.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 15 min

p95 latency tripled overnight from 80 ms to 240 ms. You have logs from the load balancer, the application, and the database. Three logs, three clocks, three grep sessions. Without distributed traces, “where did the 160 ms go” takes hours. With OpenTelemetry, it takes 30 seconds — look at the span that grew, and you have a suspect.

OpenTelemetry and W3C Trace Context

Modern observability instruments every network hop with a shared identifier. The W3C Trace Context standard defines two HTTP headers:

  • traceparent: 00-<16-byte trace-id>-<8-byte span-id>-<flags> — the globally unique trace ID and the current span’s ID.
  • tracestate: vendor-specific metadata (Datadog trace ID, Jaeger flags, etc.).

Every component in the request path — browser, CDN edge (via Cloudflare Workers), load balancer, application server, database call, external API call — reads the incoming traceparent, creates a child span with a new span-id, does its work, emits timing data, and passes the same trace-id forward in its outbound request headers.

The result: an observability backend (Tempo, Jaeger, Honeycomb, Datadog APM) can display the entire request as a waterfall of spans — each hop, each dependency, each database query, each external API call — with exact start times and durations. Finding “where did 500 ms go” reduces to looking at the longest span.

Debug this

OpenTelemetry trace for a single slow request — find the bottleneck

log
Trace 7a8f3c... duration 587ms

span: HTTP request                              0–587ms (587ms)
span: DNS lookup                              0–4ms    (4ms)
span: TCP connect                             4–32ms   (28ms)
span: TLS handshake                           32–67ms  (35ms)
span: HTTP request send                       67–69ms  (2ms)
span: Server processing                       69–512ms (443ms)
  span: Auth middleware                       69–73ms  (4ms)
  span: Database query SELECT users           73–78ms  (5ms)
  span: Database query SELECT user_settings   78–82ms  (4ms)
  span: External API call third-party.com     82–489ms (407ms)
  span: Response serialisation                489–512ms (23ms)
span: HTTP response receive                   512–587ms (75ms)

Total request is 587 ms. Which span is the bottleneck and what is your action item?

Propagation through the CDN edge

CDN edges (Cloudflare Workers, Fastly Compute, AWS CloudFront Functions) now support trace propagation. When a request arrives at the edge:

  1. Edge reads traceparent from the incoming request.
  2. Creates an edge span with its own span-id.
  3. Records time-to-first-byte from origin.
  4. Passes traceparent (with original trace-id, edge’s span-id) to the origin.

Result: the trace shows CDN edge latency as a distinct span. If the edge span is 5 ms and the origin span is 400 ms, you know to optimise origin, not CDN. Without trace propagation, you see one 405 ms total and guess at the breakdown.

ComponentTrace header actionWhat it records
BrowserGenerates root trace-id + span-idNavigation timing (LCP, DNS, TCP, TLS)
CDN edge workerReads traceparent, creates child spanEdge cache hit/miss, origin RTT
Load balancerPasses traceparent, records routingBackend selection, queue time
Application serverReads, creates child span per handlerAuth, business logic, DB calls
Database driverRecords query + execution planQuery text, rows examined, index hit
External API clientPasses traceparent outboundDependency latency, error rate

USE and RED operational frameworks

Two disciplined instrumentation frameworks prevent “metric sprawl” — collecting 500 metrics and not knowing which matter.

USE method (Brendan Gregg) for resources (CPU, memory, disk, network interfaces, connection pools):

  • Utilization — what fraction of the resource is in use? (CPU 80%, connection pool 95%)
  • Saturation — is work queuing because the resource is full? (request queue depth, run queue length)
  • Errors — is the resource failing? (TCP errors, disk errors, OOM kills)

RED method (Tom Wilkie) for services (APIs, microservices):

  • Rate — how many requests per second?
  • Errors — what fraction return errors?
  • Duration — what is the latency distribution (p50, p95, p99)?

Using both together. RED tells you what is broken (service error rate spiked). USE tells you why (CPU saturation caused by a CPU-bound handler, or connection pool saturation because the DB is slow). Together they reduce MTTR from hours to minutes.

Sampling strategies

Tracing every request at 1 M req/s produces terabytes of trace data per day — cost-prohibitive. Two strategies balance completeness against cost:

Head-based sampling. Decide at request entry whether to trace — fixed percentage (e.g., 1% of requests). Cheap and deterministic: the trace-id carries the sampling decision propagated to all downstream components. Downside: most errors and slow requests happen in the 99% you did not trace. You have no traces for your worst incidents.

Tail-based sampling. Buffer all spans in memory, decide after seeing the request outcome:

  • 100% of requests with error status (4xx, 5xx)
  • 100% of requests with duration > threshold (p99 cutoff)
  • 0.1% of fast, successful requests (baseline)

Implemented by the OpenTelemetry Collector with a tail_sampling processor. Downside: requires buffering all spans for the decision window (typically 30–60 s), using memory proportional to in-flight requests. At 1 M req/s with 30 s window, that is 30 M spans in memory — manageable with proper sharding.

Adaptive sampling. Adjusts sample rate dynamically based on system load or time of day. During incidents, bumps to 100% for error traces; during quiet periods, reduces to 0.01%.

Right pattern: head-based sampling for steady-state baseline data (cheap); tail-based sampling for errors + slow requests (guarantees traces for what matters); adaptive for cost management under variable load.

Why this works

Pure head-based sampling misses critical incidents: the 1% sample is unlikely to capture the rare 500 ms database query that occurred in the 99%. Pure tail-based has prohibitive memory cost at high traffic unless properly sharded. The combination — head for the bulk, tail for errors — achieves coverage of actionable events at acceptable cost.

Trace it
1/5

A senior SRE is paged: p95 latency tripled overnight. Trace the diagnosis using distributed tracing.

1
Step 1 of 5
Step 1: which dashboard do you open first?
2
Locked
Step 2: trace shows 160 ms spent in 'tls.handshake' span at the edge. Was the edge unhealthy?
3
Locked
Step 3: confirm with ALPN + resumption metrics. tls_resumption_rate dropped from 80% to 5%.
4
Locked
Step 4: immediate mitigation?
5
Locked
Step 5: post-mortem fix?
Quiz

Why does pure head-based sampling miss the incidents you most need traces for?

Quiz

The USE method applies to resources. Which of these correctly uses USE for a database connection pool?

Recall before you leave
  1. 01
    Explain why distributed tracing requires both head-based and tail-based sampling in production.
  2. 02
    What does W3C Trace Context define, and how does it propagate through a CDN edge?
  3. 03
    How does the USE method differ from RED, and when do you use each?
Recap

OpenTelemetry and the W3C Trace Context standard propagate a single trace ID through every hop in a request — browser, CDN edge, load balancer, application, database — surfacing a span waterfall that makes “where did 500 ms go” a 30-second question instead of a multi-hour log archaeology session. CDN edges (Cloudflare Workers, Fastly Compute) now participate in trace propagation, making edge latency measurable as a distinct span. The USE method (Utilization, Saturation, Errors) instruments resources; RED (Rate, Errors, Duration) instruments services — together they discipline you to collect metrics that drive actions. Sampling at 1 M req/s requires combining head-based sampling (low cost, steady-state baseline) with tail-based sampling (100% of errors + slow requests at the OTel Collector), because pure head-based misses the incidents you most need traces for.

Connected lessons
appears again in258
Continue the climb ↑Resilience: cascading retries, circuit breakers, and error budgets
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.