Why production logs in 2026 are JSON-or-nothing, what a usable log schema actually contains, how levels and sampling control the bill, and why PII discipline and log injection are first-class engineering concerns — not afterthoughts.
The four pieces of OTel — the API your code calls, the SDK that builds telemetry, the Collector that processes and routes it, and OTLP that carries it — and how the layered model lets you instrument once and swap backends without rewriting code.
Why RED (Rate, Errors, Duration) describes services from the caller's side, USE (Utilization, Saturation, Errors) describes resources from the kernel's side, and why senior engineers run both — plus the cardinality tax that punishes naive labelling.
SLI, SLO, and error budgets: reliability in numbers
SLI is a good/total ratio; SLO is the target; error budget is 1 − SLO. MWMBR alerting, error budget policy, SLO platforms, and the cultural adoption pattern that turns arithmetic into decisions.
Trace propagation: the headers that stitch services together
Why the W3C traceparent header is the load-bearing 55-byte string that turns 50 disconnected services into one navigable trace, how baggage carries context across async boundaries, and how head vs tail sampling decide which traces survive.
Profiling: where the CPU and the bytes actually went
How sampling profilers turn an unfair share of CPU into a flame graph you can read in 60 seconds, how eBPF and continuous profiling watch production at 2-5% overhead, and how on-CPU vs off-CPU profiles answer different questions about the same slow request.
Putting it together: a production observability story
How RED + USE + SLO + traces + profiles compose into one debugging loop, how OpenTelemetry unifies four signals through one SDK and one wire format, and what 'observability that pays for itself' actually means at production scale.