Observability OBS · 04 · 04

Golden signals, dashboard layout, and service mesh auto-RED

How Google''''s four golden signals extend RED, what a senior RED+USE dashboard looks like, how service meshes emit RED without application code, and how RED transfers to non-HTTP protocols.

OBS Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

A team has five services. Each RED dashboard looks different — different metric names, different panel layouts, different label structures. When an incident crosses services, the on-call spends 30 minutes reading documentation before starting triage. A single consistent dashboard pattern across all services would reduce that to seconds.

The four golden signals

Google’s SRE book (predating RED and USE in spirit, published later) names four signals: Latency, Traffic, Errors, Saturation. The overlap with RED is direct: Latency = Duration, Traffic = Rate, Errors = Errors. The novel piece is Saturation — “how full” the service is relative to a known capacity limit.

Unlike USE’s Saturation (which is about physical resources), the golden-signal Saturation is at the service level: in-flight request count, queue length, active session count, connection pool occupancy. A service can be at 50% CPU but have its connection pool at 100% — the service is full, users are queueing, and USE on the host shows nothing.

Many production teams treat RED + Saturation as the practical evolution of the SRE-book signals:

Signal	RED	Golden signals
Request rate	Rate	Traffic
Failed requests	Errors	Errors
Response latency	Duration	Latency
Capacity headroom	(implicit in Rate plateau)	Saturation

The senior dashboard layout

If your dashboards look different across services, you are paying a hidden tax: every new incident requires re-learning the layout instead of reading the signal. The layout below is the same for every service, every environment — the only thing that changes is the data.

A well-structured service dashboard reads top-to-bottom during an incident:

Row 1: RED  — Rate panel | Errors panel | Duration p50/p95/p99 panel
Row 2: USE  — CPU util+PSI | Memory util+PSI | Disk %util+queue
Row 3: Downstreams — DB response time + errors | Cache hit rate | Queue depth

Reading rhythm:

Which RED metric moved?
Which USE row matches the timeline?
Which downstream dependency followed the same timeline?
Click the exemplar to confirm in a trace.

The key discipline: RED and USE share the same time axis in the same dashboard. Cross-service correlation requires consistent label keys (service, route, region) so panels can be filtered by the same label set.

Latency — top-line SLO panel Duration p50 / p95 / p99

Traffic — how much demand arrives Rate · req/s

Errors — fraction of requests failing Errors · 5xx %

Saturation — how full vs capacity in-flight · queue · pool %

A golden-signals dashboard reads top-to-bottom: latency/SLO first (what the user feels), then traffic, errors, and saturation — each signal one row.

Alerting severity split

The mature pattern is to alert on user-facing symptoms (RED) at page-grade and on internal cause signals (USE) at warning-grade:

Alert	Metric	Severity
Latency SLO breach	p99 Duration > 200 ms for 5 min	Page
Error rate spike	Errors > 1% for 2 min	Page
CPU saturation	PSI cpu some > 20% for 10 min	Warning / Slack
Memory pressure	PSI memory full > 0 for 5 min	Warning / Slack
Disk headroom	Disk > 90% for 5 min	Warning / Slack

The split is deliberate. RED alerts wake humans because users are affected. USE alerts notify a slower channel so capacity teams plan ahead — not so on-call wakes up every time CPU touches 70%. This one architectural choice is among the strongest levers against alert fatigue.

RED pages because the user is already unhappy; USE warns so the capacity team plans before it becomes an incident. Routing symptom-vs-cause to different severities is the single strongest lever against alert fatigue.

Layer	Channel	Reason
RED alert (user impact)	PagerDuty / on-call page	User is already unhappy
USE alert (resource headroom)	Slack / ticket	Plan capacity before it becomes a RED incident

Service mesh auto-RED

Service meshes (Envoy, Linkerd, Istio) emit RED metrics at the sidecar without application code changes. Envoy’s downstream_rq_total, downstream_rq_xx, and downstream_rq_time give Rate, Errors-by-status, and Duration histogram per cluster, labelled consistently across the fleet.

Advantage: a polyglot estate (Node, Go, Python, JVM, Rust) gets one RED dashboard pattern even though languages disagree on Prometheus client conventions.

Catch: the sidecar sees only what the network sees. A request that succeeds at the sidecar but is mis-served by buggy application code shows as 2xx in mesh metrics.

Production pattern: run both layers — sidecar RED for breadth and consistency, application-emitted RED for business-logic fidelity (e.g. a payment-success counter labelled by gateway that the sidecar cannot see). Keep label keys aligned across both layers so dashboards join them.

RED transfers across protocols

RED’s request-centric shape transfers to every protocol; only the definition of “request” and “error” changes:

Protocol	Rate	Errors	Duration
HTTP	req/s	HTTP 5xx	p99 response time
gRPC	RPCs/s	status != OK	end-to-end for unary, first-byte for streaming
Queues (Kafka, SQS)	messages/s consumed	dead-letter rate	publish-to-consume wait time
Batch jobs	jobs/min	failed jobs	wall-clock per job
Serverless	invocations/s	error rate + throttle rate	includes cold-start tail

For async and queue-based services, queue depth (backlog) is the Saturation signal that pure RED misses. A queue with Duration p99 of 10 s per job may have a 10-minute backlog — the user waits 10 min + 10 s. RED captures the job-processing time; Saturation captures the queue-wait before processing begins.

▸Why this works

Batch and async systems need a separate queue-depth metric that RED alone cannot provide. The pattern is to emit queue_depth_seconds — the age of the oldest unprocessed item in wall-clock seconds. If this grows, users are waiting longer than the job Duration suggests. This is the Saturation signal at the service level, complementing USE’s resource-level Saturation.

Quiz

What does 'Saturation' mean in the four golden signals, and how does it differ from USE's Saturation?

Quiz

An on-call engineer sees RED Duration p99 climb but Rate and Errors are flat. Which USE row should they check first?

Order the steps

Order the RED + USE incident response steps in production:

1 Page fires — symptom description includes service name and severity
2 Open the service's RED dashboard, identify which of R / E / D moved
3 Read the time series — single spike or sustained drift?
4 Cross-reference USE on the same hosts and on direct downstream dependencies
5 If RED-D moved with USE-CPU saturation: capacity issue → scale
6 If RED-E moved with USE-Errors on a dependency: dependency failure → failover
7 If both RED and USE look fine but pager fired: revisit the alert source, suspect false positive

Recall before you leave

01
What does Google's SRE book add to RED that RED itself does not cover?
02
What is the practical advantage and limitation of service-mesh auto-RED?
03
For a Kafka consumer, what are Rate, Errors, and Duration, and what extra signal is needed?

Recap

Google’s four golden signals — Latency, Traffic, Errors, Saturation — extend RED by adding a service-level Saturation dimension: how full the service is relative to its logical capacity (connection pool, in-flight count, queue depth), not just physical resources. A mature RED+USE dashboard puts RED panels in the top row, USE panels in the middle, and downstream dependency RED+USE in the bottom row, all on the same time axis with consistent label keys. Alerting discipline is the other half: RED alerts page on-call (user is affected), USE alerts go to Slack or a ticket (capacity team plans ahead). Service meshes emit RED for free at the sidecar level — consistent across languages — but only see what the network sees; application-emitted RED is needed for business-logic fidelity. RED’s shape transfers to gRPC, queues, batch jobs, and serverless with minimal adjustment — only the definition of “request” and “error” changes. Now when you build a new service dashboard, you will resist the urge to make it unique — same layout, same label keys, and your future on-call self will find the answer in seconds instead of minutes.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.

Apply this

Put this lesson to work on a real build.

Collaborative cursorsShow every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.Job schedulerA cron + backoff job runner with at-least-once delivery, idempotent handlers, and visibility timeouts — so no job is silently lost even when workers crash mid-execution.