awesome-everything RU
↑ Back to the climb

Observability

Three pillars: code and config reading

Crux Read real instrumentation code, a log line, a PromQL alert, and a tail-sampling config; predict the observability behaviour and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min

Observability bugs are written in instrumentation code and config, not in dashboards. Read the snippet, predict what it does to your backend’s cost or correctness, then choose the fix a senior engineer makes first.

Goal

Practise the loop you run in every observability incident: read the instrumentation, predict the cardinality, volume, or sampling consequence, and reach for the highest-leverage fix before paging anyone.

Snippet 1 — the metric label

requestsTotal := prometheus.NewCounterVec(
    prometheus.CounterOpts{Name: "http_requests_total"},
    []string{"route", "method", "status_class", "customer_email"},
)
// in the handler:
requestsTotal.WithLabelValues(route, method, statusClass, req.CustomerEmail).Inc()
Quiz

This counter is deployed to a service with 100k+ active customers. What happens, and what is the correct fix that keeps per-customer drill-through?

Snippet 2 — the retry log line

for attempt in range(max_retries):  # max_retries was just raised 3 -> 20
    try:
        return gateway.charge(order)
    except TimeoutError:
        logger.info("payment retry", extra={"order_id": order.id, "attempt": attempt})
Quiz

Traffic is flat but the logging bill jumped after max_retries went from 3 to 20. Which statement is correct, and what is the durable fix?

Snippet 3 — the cardinality alert

# alert expression
rate(prometheus_tsdb_head_series_created_total[5m]) > 1000

# triage query, run when it fires:
topk(10, count by (__name__) ({__name__=~".+"}))
Quiz

What does this alert detect, and what does the triage query tell you when it fires?

Snippet 4 — the tail-sampling config

policies:
  - name: errors-policy
    type: status_code
    status_code: {status_codes: [ERROR]}
  - name: slow-traces-policy
    type: latency
    latency: {threshold_ms: 1000}
  - name: baseline-policy
    type: probabilistic
    probabilistic: {sampling_percentage: 2}
Quiz

A reviewer asks: 'why three policies, and what is the cost we still pay even though we only store ~2% of successful traces?'

Recap

Every observability problem is read in code and config: an unbounded label like customer_email is a cardinality bomb (drop it, keep a bounded segment plus an exemplar); a per-iteration log line multiplies volume regardless of traffic (emit one summary plus a metric counter); a head_series_created rate alert catches the bomb before the OOM and the topk query names the culprit; and a tail-sampling config keeps 100% of errors and slow traces while still paying collector cost proportional to raw traffic. Read the instrumentation, predict the cost axis, fix at the source.

Continue the climb ↑Three pillars: build a navigable observability surface
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources2
expand
  1. 01
  2. 02

Trademarks belong to their respective owners. Editorial reference only.