Observability OBS · 01 · 09

Three pillars: code and config reading

Read real instrumentation code, a log line, a PromQL alert, and a tail-sampling config; predict the observability behaviour and pick the highest-leverage fix.

OBS Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Observability bugs are written in instrumentation code and config, not in dashboards. Read the snippet, predict what it does to your backend’s cost or correctness, then choose the fix a senior engineer makes first.

Goal

Practise the loop you run in every observability incident: read the instrumentation, predict the cardinality, volume, or sampling consequence, and reach for the highest-leverage fix before paging anyone.

Snippet 1 — the metric label

requestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "http_requests_total"},
    []string{"route", "method", "status_class", "customer_email"},
)
// in the handler:
requestsTotal.WithLabelValues(route, method, statusClass, req.CustomerEmail).Inc()

Quiz

This counter is deployed to a service with 100k+ active customers. What happens, and what is the correct fix that keeps per-customer drill-through?

Snippet 2 — the retry log line

for attempt in range(max_retries):  # max_retries was just raised 3 -> 20
    try:
        return gateway.charge(order)
    except TimeoutError:
        logger.info("payment retry", extra={"order_id": order.id, "attempt": attempt})

Quiz

Traffic is flat but the logging bill jumped after max_retries went from 3 to 20. Which statement is correct, and what is the durable fix?

Snippet 3 — the cardinality alert

# alert expression
rate(prometheus_tsdb_head_series_created_total[5m]) > 1000

# triage query, run when it fires:
topk(10, count by (__name__) ({__name__=~".+"}))

Quiz

What does this alert detect, and what does the triage query tell you when it fires?

Snippet 4 — the tail-sampling config

policies:
  - name: errors-policy
    type: status_code
    status_code: {status_codes: [ERROR]}
  - name: slow-traces-policy
    type: latency
    latency: {threshold_ms: 1000}
  - name: baseline-policy
    type: probabilistic
    probabilistic: {sampling_percentage: 2}

Quiz

A reviewer asks: 'why three policies, and what is the cost we still pay even though we only store ~2% of successful traces?'

Recap

Every observability problem is read in code and config: an unbounded label like customer_email is a cardinality bomb (drop it, keep a bounded segment plus an exemplar); a per-iteration log line multiplies volume regardless of traffic (emit one summary plus a metric counter); a head_series_created rate alert catches the bomb before the OOM and the topk query names the culprit; and a tail-sampling config keeps 100% of errors and slow traces while still paying collector cost proportional to raw traffic. Read the instrumentation, predict the cost axis, fix at the source.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.