awesome-everything RU
↑ Back to the climb

Observability

Golden signals, dashboard layout, and service mesh auto-RED

Crux How Google''''s four golden signals extend RED, what a senior RED+USE dashboard looks like, how service meshes emit RED without application code, and how RED transfers to non-HTTP protocols.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 12 min

A team has five services. Each RED dashboard looks different — different metric names, different panel layouts, different label structures. When an incident crosses services, the on-call spends 30 minutes reading documentation before starting triage. A single consistent dashboard pattern across all services would reduce that to seconds.

The four golden signals

Google’s SRE book (predating RED and USE in spirit, published later) names four signals: Latency, Traffic, Errors, Saturation. The overlap with RED is direct: Latency = Duration, Traffic = Rate, Errors = Errors. The novel piece is Saturation — “how full” the service is relative to a known capacity limit.

Unlike USE’s Saturation (which is about physical resources), the golden-signal Saturation is at the service level: in-flight request count, queue length, active session count, connection pool occupancy. A service can be at 50% CPU but have its connection pool at 100% — the service is full, users are queueing, and USE on the host shows nothing.

Many production teams treat RED + Saturation as the practical evolution of the SRE-book signals:

SignalREDGolden signals
Request rateRateTraffic
Failed requestsErrorsErrors
Response latencyDurationLatency
Capacity headroom(implicit in Rate plateau)Saturation

The senior dashboard layout

A well-structured service dashboard reads top-to-bottom during an incident:

Row 1: RED  — Rate panel | Errors panel | Duration p50/p95/p99 panel
Row 2: USE  — CPU util+PSI | Memory util+PSI | Disk %util+queue
Row 3: Downstreams — DB response time + errors | Cache hit rate | Queue depth

Reading rhythm:

  1. Which RED metric moved?
  2. Which USE row matches the timeline?
  3. Which downstream dependency followed the same timeline?
  4. Click the exemplar to confirm in a trace.

The key discipline: RED and USE share the same time axis in the same dashboard. Cross-service correlation requires consistent label keys (service, route, region) so panels can be filtered by the same label set.

Alerting severity split

The mature pattern is to alert on user-facing symptoms (RED) at page-grade and on internal cause signals (USE) at warning-grade:

AlertMetricSeverity
Latency SLO breachp99 Duration > 200 ms for 5 minPage
Error rate spikeErrors > 1% for 2 minPage
CPU saturationPSI cpu some > 20% for 10 minWarning / Slack
Memory pressurePSI memory full > 0 for 5 minWarning / Slack
Disk headroomDisk > 90% for 5 minWarning / Slack

The split is deliberate. RED alerts wake humans because users are affected. USE alerts notify a slower channel so capacity teams plan ahead — not so on-call wakes up every time CPU touches 70%. This one architectural choice is among the strongest levers against alert fatigue.

LayerChannelReason
RED alert (user impact)PagerDuty / on-call pageUser is already unhappy
USE alert (resource headroom)Slack / ticketPlan capacity before it becomes a RED incident

Service mesh auto-RED

Service meshes (Envoy, Linkerd, Istio) emit RED metrics at the sidecar without application code changes. Envoy’s downstream_rq_total, downstream_rq_xx, and downstream_rq_time give Rate, Errors-by-status, and Duration histogram per cluster, labelled consistently across the fleet.

Advantage: a polyglot estate (Node, Go, Python, JVM, Rust) gets one RED dashboard pattern even though languages disagree on Prometheus client conventions.

Catch: the sidecar sees only what the network sees. A request that succeeds at the sidecar but is mis-served by buggy application code shows as 2xx in mesh metrics.

Production pattern: run both layers — sidecar RED for breadth and consistency, application-emitted RED for business-logic fidelity (e.g. a payment-success counter labelled by gateway that the sidecar cannot see). Keep label keys aligned across both layers so dashboards join them.

RED transfers across protocols

RED’s request-centric shape transfers to every protocol; only the definition of “request” and “error” changes:

ProtocolRateErrorsDuration
HTTPreq/sHTTP 5xxp99 response time
gRPCRPCs/sstatus != OKend-to-end for unary, first-byte for streaming
Queues (Kafka, SQS)messages/s consumeddead-letter ratepublish-to-consume wait time
Batch jobsjobs/minfailed jobswall-clock per job
Serverlessinvocations/serror rate + throttle rateincludes cold-start tail

For async and queue-based services, queue depth (backlog) is the Saturation signal that pure RED misses. A queue with Duration p99 of 10 s per job may have a 10-minute backlog — the user waits 10 min + 10 s. RED captures the job-processing time; Saturation captures the queue-wait before processing begins.

Why this works

Batch and async systems need a separate queue-depth metric that RED alone cannot provide. The pattern is to emit queue_depth_seconds — the age of the oldest unprocessed item in wall-clock seconds. If this grows, users are waiting longer than the job Duration suggests. This is the Saturation signal at the service level, complementing USE’s resource-level Saturation.

Quiz

What does 'Saturation' mean in the four golden signals, and how does it differ from USE's Saturation?

Quiz

An on-call engineer sees RED Duration p99 climb but Rate and Errors are flat. Which USE row should they check first?

Order the steps

Order the RED + USE incident response steps in production:

  1. 1 Page fires — symptom description includes service name and severity
  2. 2 Open the service's RED dashboard, identify which of R / E / D moved
  3. 3 Read the time series — single spike or sustained drift?
  4. 4 Cross-reference USE on the same hosts and on direct downstream dependencies
  5. 5 If RED-D moved with USE-CPU saturation: capacity issue → scale
  6. 6 If RED-E moved with USE-Errors on a dependency: dependency failure → failover
  7. 7 If both RED and USE look fine but pager fired: revisit the alert source, suspect false positive
Recall before you leave
  1. 01
    What does Google's SRE book add to RED that RED itself does not cover?
  2. 02
    What is the practical advantage and limitation of service-mesh auto-RED?
  3. 03
    For a Kafka consumer, what are Rate, Errors, and Duration, and what extra signal is needed?
Recap

Google’s four golden signals — Latency, Traffic, Errors, Saturation — extend RED by adding a service-level Saturation dimension: how full the service is relative to its logical capacity (connection pool, in-flight count, queue depth), not just physical resources. A mature RED+USE dashboard puts RED panels in the top row, USE panels in the middle, and downstream dependency RED+USE in the bottom row, all on the same time axis with consistent label keys. Alerting discipline is the other half: RED alerts page on-call (user is affected), USE alerts go to Slack or a ticket (capacity team plans ahead). Service meshes emit RED for free at the sidecar level — consistent across languages — but only see what the network sees; application-emitted RED is needed for business-logic fidelity. RED’s shape transfers to gRPC, queues, batch jobs, and serverless with minimal adjustment — only the definition of “request” and “error” changes.

Connected lessons
appears again in167
Continue the climb ↑Cardinality as a cost driver: labels, PII, exemplars, and sampling
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.