Networking & Protocols NET · 12 · 06

Observability: distributed traces, USE/RED, and sampling

OpenTelemetry W3C Trace Context propagation reveals which hop ate the 500 ms; USE and RED methods discipline instrumentation; head-based vs tail-based sampling controls the cost of capturing traces at 1M req/s.

NET Senior ◷ 15 min

Level

FoundationsJuniorMiddleSenior

p95 latency tripled overnight from 80 ms to 240 ms. You have logs from the load balancer, the application, and the database. Three logs, three clocks, three grep sessions. Without distributed traces, “where did the 160 ms go” takes hours. With OpenTelemetry, it takes 30 seconds — look at the span that grew, and you have a suspect.

OpenTelemetry and W3C Trace Context

Before distributed tracing existed, diagnosing a slow request meant correlating timestamps across separate log files — an error-prone process that scales badly when you have ten services. Modern observability instruments every network hop with a shared identifier. The W3C Trace Context standard defines two HTTP headers:

traceparent: 00-<16-byte trace-id>-<8-byte span-id>-<flags> — the globally unique trace ID and the current span’s ID.
tracestate: vendor-specific metadata (Datadog trace ID, Jaeger flags, etc.).

Every component in the request path — browser, CDN edge (via Cloudflare Workers), load balancer, application server, database call, external API call — reads the incoming traceparent, creates a child span with a new span-id, does its work, emits timing data, and passes the same trace-id forward in its outbound request headers.

The result: an observability backend (Tempo, Jaeger, Honeycomb, Datadog APM) can display the entire request as a waterfall of spans — each hop, each dependency, each database query, each external API call — with exact start times and durations. Finding “where did 500 ms go” reduces to looking at the longest span.

Plotted side by side, the external API span (407 ms) is 69% of the whole request — the longest bar is the bottleneck, exactly what the waterfall makes a 30-second glance instead of a log hunt.

Debug this

OpenTelemetry trace for a single slow request — find the bottleneck

log

Trace 7a8f3c... duration 587ms

span: HTTP request                              0–587ms (587ms)
span: DNS lookup                              0–4ms    (4ms)
span: TCP connect                             4–32ms   (28ms)
span: TLS handshake                           32–67ms  (35ms)
span: HTTP request send                       67–69ms  (2ms)
span: Server processing                       69–512ms (443ms)
  span: Auth middleware                       69–73ms  (4ms)
  span: Database query SELECT users           73–78ms  (5ms)
  span: Database query SELECT user_settings   78–82ms  (4ms)
  span: External API call third-party.com     82–489ms (407ms)
  span: Response serialisation                489–512ms (23ms)
span: HTTP response receive                   512–587ms (75ms)

Total request is 587 ms. Which span is the bottleneck and what is your action item?

Propagation through the CDN edge

CDN edges (Cloudflare Workers, Fastly Compute, AWS CloudFront Functions) now support trace propagation. When a request arrives at the edge:

Edge reads traceparent from the incoming request.
Creates an edge span with its own span-id.
Records time-to-first-byte from origin.
Passes traceparent (with original trace-id, edge’s span-id) to the origin.

Result: the trace shows CDN edge latency as a distinct span. If the edge span is 5 ms and the origin span is 400 ms, you know to optimise origin, not CDN. Without trace propagation, you see one 405 ms total and guess at the breakdown.

Component	Trace header action	What it records
Browser	Generates root trace-id + span-id	Navigation timing (LCP, DNS, TCP, TLS)
CDN edge worker	Reads traceparent, creates child span	Edge cache hit/miss, origin RTT
Load balancer	Passes traceparent, records routing	Backend selection, queue time
Application server	Reads, creates child span per handler	Auth, business logic, DB calls
Database driver	Records query + execution plan	Query text, rows examined, index hit
External API client	Passes traceparent outbound	Dependency latency, error rate

USE and RED operational frameworks

When you instrument a new service, the temptation is to collect everything. Two disciplined instrumentation frameworks prevent “metric sprawl” — collecting 500 metrics and not knowing which matter.

USE method (Brendan Gregg) for resources (CPU, memory, disk, network interfaces, connection pools):

Utilization — what fraction of the resource is in use? (CPU 80%, connection pool 95%)
Saturation — is work queuing because the resource is full? (request queue depth, run queue length)
Errors — is the resource failing? (TCP errors, disk errors, OOM kills)

RED method (Tom Wilkie) for services (APIs, microservices):

Rate — how many requests per second?
Errors — what fraction return errors?
Duration — what is the latency distribution (p50, p95, p99)?

Using both together. RED tells you what is broken (service error rate spiked). USE tells you why (CPU saturation caused by a CPU-bound handler, or connection pool saturation because the DB is slow). Together they reduce MTTR from hours to minutes.

Sampling strategies

Tracing every request at 1 M req/s produces terabytes of trace data per day — cost-prohibitive. Two strategies balance completeness against cost:

Head-based sampling. Decide at request entry whether to trace — fixed percentage (e.g., 1% of requests). Cheap and deterministic: the trace-id carries the sampling decision propagated to all downstream components. Downside: most errors and slow requests happen in the 99% you did not trace. You have no traces for your worst incidents.

Tail-based sampling. Buffer all spans in memory, decide after seeing the request outcome:

100% of requests with error status (4xx, 5xx)
100% of requests with duration > threshold (p99 cutoff)
0.1% of fast, successful requests (baseline)

Implemented by the OpenTelemetry Collector with a tail_sampling processor. Downside: requires buffering all spans for the decision window (typically 30–60 s), using memory proportional to in-flight requests. At 1 M req/s with 30 s window, that is 30 M spans in memory — manageable with proper sharding.

Adaptive sampling. Adjusts sample rate dynamically based on system load or time of day. During incidents, bumps to 100% for error traces; during quiet periods, reduces to 0.01%.

Right pattern: head-based sampling for steady-state baseline data (cheap); tail-based sampling for errors + slow requests (guarantees traces for what matters); adaptive for cost management under variable load.

▸Why this works

Pure head-based sampling misses critical incidents: the 1% sample is unlikely to capture the rare 500 ms database query that occurred in the 99%. Pure tail-based has prohibitive memory cost at high traffic unless properly sharded. The combination — head for the bulk, tail for errors — achieves coverage of actionable events at acceptable cost.

Trace it

1/5

A senior SRE is paged: p95 latency tripled overnight. Trace the diagnosis using distributed tracing.

Step 1 of 5

Step 1: which dashboard do you open first?

Locked

Step 2: trace shows 160 ms spent in 'tls.handshake' span at the edge. Was the edge unhealthy?

Locked

Step 3: confirm with ALPN + resumption metrics. tls_resumption_rate dropped from 80% to 5%.

Locked

Step 4: immediate mitigation?

Locked

Step 5: post-mortem fix?

Quiz

Why does pure head-based sampling miss the incidents you most need traces for?

Quiz

The USE method applies to resources. Which of these correctly uses USE for a database connection pool?

One trace-id stitches every hop into a single span waterfall, so finding where 500 ms went reduces to spotting the longest span — no grep across three logs. Here the External API child span is the bottleneck.

Recall before you leave

01
Explain why distributed tracing requires both head-based and tail-based sampling in production.
02
What does W3C Trace Context define, and how does it propagate through a CDN edge?
03
How does the USE method differ from RED, and when do you use each?

Recap

OpenTelemetry and the W3C Trace Context standard propagate a single trace ID through every hop in a request — browser, CDN edge, load balancer, application, database — surfacing a span waterfall that makes “where did 500 ms go” a 30-second question instead of a multi-hour log archaeology session. CDN edges (Cloudflare Workers, Fastly Compute) now participate in trace propagation, making edge latency measurable as a distinct span. The USE method (Utilization, Saturation, Errors) instruments resources; RED (Rate, Errors, Duration) instruments services — together they discipline you to collect metrics that drive actions. Sampling at 1 M req/s requires combining head-based sampling (low cost, steady-state baseline) with tail-based sampling (100% of errors + slow requests at the OTel Collector), because pure head-based misses the incidents you most need traces for. Now when you see a p95 spike in production, your first move is opening the trace waterfall — not grepping logs — and the longest span tells you exactly where to focus.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

unlocks

Resilience: cascading retries, circuit breakers, and error budgetssenior

deepens into

Resilience: cascading retries, circuit breakers, and error budgetssenior

appears again in287

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.