Observability OBS · 08 · 03

Cost discipline: keeping observability under 5% of infra spend

Three levers — cardinality limits on metrics, tail sampling on traces, and tiered log retention — keep observability cost at $1–5/million requests even as traffic grows 10x.

OBS Middle ◷ 13 min

Level

FoundationsJuniorMiddleSenior

A 200-service engineering org sends you the observability bill: $680k/year, growing faster than traffic. The platform team has two quarters to cut it in half without losing debugging power. No new tooling. No vendor change. The difference between a well-tuned and an untuned org is not the backend — it is three levers and the discipline to apply them.

Why observability scales superlinearly without discipline

Production observability scales linearly with traffic for the baseline signal volume, but superlinearly with cardinality. One new unbounded label (e.g. user_id added to a metric) can multiply the metric series count by 10–100x overnight. Trace volume grows with request count. Log volume grows with request count and can spike on errors. Without governance, a 10x traffic increase can mean a 50–100x cost increase.

The industry benchmark for a well-tuned org: $1–5 per million requests for full four-signal observability. For an untuned org: $10–50 per million requests. The difference is engineering discipline, not vendor choice.

The three cost levers

Lever 1: Cardinality limits on metrics

Every unique combination of label values creates one time-series. A metric http_requests_total{service, route, status_code} with 200 services × 500 routes × 600 status codes = 60 million series. The realistic cardinality budget is 5–10k active series per service.

Rules that prevent cardinality explosions:

Use route templates, not full URLs (/users/{id} not /users/17384).
Use status classes, not exact codes (2xx, 4xx, 5xx not 200, 201, 404).
Never use user_id, request_id, email, or any unbounded value as a label.
Enforce limits in CI: a lint step rejects any new label whose values are unbounded.

Together these rules mean a single engineer adding one innocent-looking label can multiply your entire metric bill tenfold overnight. Without CI enforcement you will catch it only when the invoice arrives.

Lever 2: Tail sampling on traces

Without tail sampling a 200-service org at 10k requests/second generates ~10k trace trees per second. Keeping all of them at 30-day retention is the biggest cost driver. Tail sampling policy:

100% of traces with any ERROR span.
100% of traces with total duration above the slow threshold (2 s typical).
5% random baseline of everything else.

Expected result: 80–90% trace volume reduction with zero debugging-power loss. Errors and slow traces are the only traces you ever navigate; the 5% baseline provides context for rate/pattern analysis.

Lever 3: Tiered retention on logs and profiles

Not all data needs to be queryable at full resolution forever.

Tier	Retention	Cost level	Use case
Hot (full resolution)	7–14 days	High	Incident investigation
Summary / rollup	30–90 days	Medium	Trend analysis, SLO review
Archival (object storage)	90 days–7 years	Low	Compliance, rare deep dives

Logs in mature orgs: hot 14 days, summary 30 days, archival 90 days. Profiles: hot 30 days, downsampled 90 days. Tiered retention makes cost grow logarithmically with time rather than linearly.

Each lever shrinks the stream before it reaches the expensive store: cardinality limits cut series, tail sampling drops 80–90% of traces, rollups summarise, and tiered retention ages data into cheap storage.

Signal	% of typical bill	Primary cost lever
Logs	~40%	Tiered retention + log-level audit
Metrics	~25%	Cardinality limits enforced in CI
Traces	~25%	Tail sampling (80–90% volume drop)
Profiles	~10%	Per-service sample rate (default 99 Hz)

The cost-cutting playbook applied

Back to the $680k/year platform team asked to cut costs by 50% in two quarters.

Step 1: Pareto analysis. Pull the bill broken down by service and signal type. Top 5 services account for ~60% of cost. Top 1 signal type (usually logs) accounts for ~50% of cost.

Step 2: Logs at $300k/year. Look for: verbose log levels in production (INFO or DEBUG instead of WARN); error messages logged on every retry; per-request log lines that duplicate trace span events; repeated identical messages. Fixes: log-level audit, dedup, move per-request info into span attributes, rate-limit noisy events.

Step 3: Metrics at $200k. Find the top 20 metrics by series count. Most will have unbounded labels. Drop them via collector relabeling; route the dimension into traces or logs instead. Expected: 50–70% series reduction.

Step 4: Traces at $150k with no tail sampling. Deploy tail-sampling policy (errors 100% + slow 100% + 5% baseline). Expected: 80–90% trace volume reduction.

Step 5: Profiles at $30k. Leave them. Profiling is ~10% of total cost; cutting it saves little and destroys CPU debugging power. Better to turn it on for services that lack it — saves more by preventing future incidents.

Expected results: Logs −40% ($120k saved), metrics −50% ($100k saved), traces −85% ($127k saved). Total: $347k saved. Cost from $680k to $333k — 49% reduction in two quarters.

Tail sampling on traces is the single highest-leverage move: it saves $127k even though traces are only a quarter of the bill, because the 85% volume cut dwarfs the 40% logs cut on a larger base.

MTTD vs MTTR: both matter, MTTD is the bigger lever

Mean Time To Detect (MTTD): how long between something breaking and an alert firing. Multi-window burn-rate alerts (unit 05) drive MTTD to 1–5 minutes.

Mean Time To Resolve (MTTR): how long between alert and fix. Funnel discipline drives MTTR to 3–10 minutes.

A 30-second MTTD turns a 10-minute incident into a 5-minute one for users. A 5-minute MTTD turns the same incident into 15 minutes. MTTD compounds faster than MTTR because it affects every request sent during the failure window. Measure both separately; show both on the team reliability scorecard.

Production observability: typical cost parameters

Well-tuned org cost / million requests: $1–5
Untuned org cost / million requests: $10–50
Observability / total infra cost target: 3–7%
Cardinality budget per service: 5–10k active series
Trace volume reduction (tail sampling): 80–90%
Typical MTTD with burn-rate alerts: 1–5 minutes
Typical MTTR with funnel discipline: 3–10 minutes

Quiz

A platform team is asked to cut observability cost by 30% without losing debugging power. Which set of moves is most likely to deliver?

Quiz

A team adds a new metric: http_requests_total{service, user_id, endpoint}. The service handles 50k unique users and 200 endpoints. How many new time series does this create?

Recall before you leave

01
Name the three primary cost levers and state what each one controls.
02
Walk the cost-cutting playbook for a $680k/year observability spend that needs to reach $340k in two quarters.
03
Why is MTTD often a bigger lever than MTTR for user-facing impact?

Recap

Production observability scales superlinearly with cardinality without discipline: one unbounded label can multiply a metric’s series count 100x overnight. The industry benchmark is $1–5/million requests for a well-tuned org versus $10–50 for an untuned one — a 10x difference driven by data discipline, not vendor choice. Three levers control the bill: cardinality limits enforced in CI (no unbounded labels, route templates not URLs), tail sampling that keeps 100% of errors and slow traces while discarding 90%+ of baseline volume, and tiered retention that moves data from hot storage to cheap object storage as it ages. The cost-cutting playbook runs Pareto first (top 5 services drive 60% of cost), then addresses logs (−40%), metrics (−50%), and traces (−85%) in descending cost order. Profiling at ~10% of total cost is the last thing to touch. MTTD from burn-rate alerting and MTTR from funnel discipline compound to keep most incidents under 10 minutes total for users. Now when you see a metric label proposal in a code review, your first question is: “What is the cardinality bound on this value?” — if no one can answer it, the label does not merge.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Connected lessons

builds on

unlocks

The incident loop: from pager to postmortem to preventionmiddle

deepens into

The incident loop: from pager to postmortem to preventionmiddle

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.