Crux Read real instrumentation code, a log line, a PromQL alert, and a tail-sampling config; predict the observability behaviour and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Observability bugs are written in instrumentation code and config, not in dashboards. Read the snippet, predict what it does to your backend’s cost or correctness, then choose the fix a senior engineer makes first.
Goal
Practise the loop you run in every observability incident: read the instrumentation, predict the cardinality, volume, or sampling consequence, and reach for the highest-leverage fix before paging anyone.
Snippet 1 — the metric label
requestsTotal := prometheus.NewCounterVec( prometheus.CounterOpts{Name: "http_requests_total"}, []string{"route", "method", "status_class", "customer_email"},)// in the handler:requestsTotal.WithLabelValues(route, method, statusClass, req.CustomerEmail).Inc()
Quiz
Completed
This counter is deployed to a service with 100k+ active customers. What happens, and what is the correct fix that keeps per-customer drill-through?
Heads-up The increment is free; the cost is the series-count memory. Every distinct customer_email is a separate series at ~3 KB in the head block — that is the cardinality bomb.
Heads-up Email strings are valid label values; '@' is not the problem. Unbounded cardinality is — Prometheus stores the series and runs out of memory.
Heads-up Scrape interval changes sample resolution, not the number of active series. The head block memory is driven by series count, which the unbounded label inflates.
Snippet 2 — the retry log line
for attempt in range(max_retries): # max_retries was just raised 3 -> 20 try: return gateway.charge(order) except TimeoutError: logger.info("payment retry", extra={"order_id": order.id, "attempt": attempt})
Quiz
Completed
Traffic is flat but the logging bill jumped after max_retries went from 3 to 20. Which statement is correct, and what is the durable fix?
Heads-up Encoding is not the issue; the line COUNT is. Plaintext would still emit one line per attempt and would be harder to query. Emit one summary line plus a metric instead.
Heads-up Retry count is business logic; changing it to fix a log bill is the wrong lever. The fix is log discipline: one summary line per logical operation, not one per iteration.
Heads-up Log volume tracks emitted LINES, not request count. A loop that logs per iteration multiplies volume with the iteration count even at flat traffic.
Snippet 3 — the cardinality alert
# alert expressionrate(prometheus_tsdb_head_series_created_total[5m]) > 1000# triage query, run when it fires:topk(10, count by (__name__) ({__name__=~".+"}))
Quiz
Completed
What does this alert detect, and what does the triage query tell you when it fires?
Heads-up head_series_created_total counts SERIES creation, not requests. The query counts series per metric name, not endpoint latency. This is cardinality observability, not RED metrics.
Heads-up It fires on series-creation rate, the precursor to OOM. Restarting replays the WAL and reloads the same exploding series — the fix is metric_relabel_configs labeldrop, then removing the label in code.
Heads-up These are Prometheus TSDB metrics about series, not log ingest. Log runaway is detected by GB/day per service in the log vendor dashboard, a different signal.
A reviewer asks: 'why three policies, and what is the cost we still pay even though we only store ~2% of successful traces?'
Heads-up Probabilistic 2% would randomly drop 98% of errors and slow traces — exactly the traces you need in an incident. The error and latency policies exist to guarantee 100% retention of those.
Heads-up It stores less but costs MORE at the collector: every span is buffered to make a context-aware decision, so collector resources scale with raw traffic. Head sampling drops at trace start and never buffers.
Heads-up The status_code policy keys on the span STATUS (OTel ERROR status), not the HTTP response code. A 200 with a recorded error status is kept; a 4xx that the app marks OK may not be.
Recap
Every observability problem is read in code and config: an unbounded label like customer_email is a cardinality bomb (drop it, keep a bounded segment plus an exemplar); a per-iteration log line multiplies volume regardless of traffic (emit one summary plus a metric counter); a head_series_created rate alert catches the bomb before the OOM and the topk query names the culprit; and a tail-sampling config keeps 100% of errors and slow traces while still paying collector cost proportional to raw traffic. Read the instrumentation, predict the cost axis, fix at the source.