observability

Observability

How to see what your running system is doing — through logs, metrics, and traces — so that when something breaks at 3am you can actually find out why.

9 units·87 lessons·~50 h

Start track →

Start from zero

Before the senior material: what observability even is, and the handful of words the rest of the track assumes you already know.

01 Start from zero: what observability actually is 10 min

Three pillars: metrics, logs, and traces

Metrics, logs, and traces each answer a different question most cheaply. Join keys and exemplars make them compose into one navigable surface.

01 What the three signals are: logs, metrics, and traces 10 min 02 Metrics and cardinality: the cost model of a time-series database 14 min 03 Logs and volume: the cost model of structured logging 12 min 04 Traces and sampling: the cost model of distributed tracing 13 min 05 Join keys and exemplars: making the three signals compose 12 min 06 Observability 2.0: wide events and the cost shift 13 min 07 Failure modes and engineering practice: cardinality budgets, PII, and sampling 14 min 08 Three pillars: multiple-choice review 13 min 09 Three pillars: free-recall review 14 min 10 Three pillars: code and config reading 14 min 11 Three pillars: build a navigable observability surface 240 min

Structured logging: schema, levels, redaction

Why production logs in 2026 are JSON-or-nothing, what a usable log schema actually contains, how levels and sampling control the bill, and why PII discipline and log injection are first-class engineering concerns — not afterthoughts.

01 Why structured logs exist: the diary vs the spreadsheet 8 min 02 The production log schema: fields every line must carry 12 min 03 Log levels and alert routing 10 min 04 Sampling strategies and log cost 12 min 05 PII redaction and log injection 12 min 06 Trace context propagation in logs 12 min 07 OTel Logs Data Model and audit logs as a subsystem 14 min 08 Structured logging: multiple-choice review 13 min 09 Structured logging: free-recall review 14 min 10 Structured logging: code and log reading 14 min 11 Structured logging: build a production logging pipeline 240 min

OpenTelemetry: API, SDK, Collector, OTLP

The four pieces of OTel — the API your code calls, the SDK that builds telemetry, the Collector that processes and routes it, and OTLP that carries it — and how the layered model lets you instrument once and swap backends without rewriting code.

01 What is OpenTelemetry: API, SDK, Collector, OTLP 10 min 02 OTel signals, Semantic Conventions, and the OTLP wire format 12 min 03 Auto-instrumentation and manual spans: the 80/20 of OTel 11 min 04 The OTel Collector: receivers, processors, exporters, and deployment patterns 13 min 05 Sampling strategies: head, tail, and parent-based 13 min 06 Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTel 14 min 07 Operating the OTel Collector: reliability, version skew, failure modes, and governance 15 min 08 OTel: multiple-choice review 13 min 09 OTel: free-recall review 13 min 10 OTel: config and trace reading 14 min 11 OTel: build a vendor-neutral pipeline 240 min

RED and USE: the two halves of every dashboard

Why RED (Rate, Errors, Duration) describes services from the caller's side, USE (Utilization, Saturation, Errors) describes resources from the kernel's side, and why senior engineers run both — plus the cardinality tax that punishes naive labelling.

01 RED and USE: two checklists, one triage discipline 10 min 02 Instrumenting RED in Prometheus: counters, histograms, and cardinality discipline 14 min 03 USE on Linux: CPU, memory, disk, network, and PSI 14 min 04 Golden signals, dashboard layout, and service mesh auto-RED 12 min 05 Cardinality as a cost driver: labels, PII, exemplars, and sampling 14 min 06 Native histograms, SLO tie-in, and production failure patterns 16 min 07 RED and USE: multiple-choice review 13 min 08 RED and USE: free-recall review 13 min 09 RED and USE: PromQL and signal reading 14 min 10 RED and USE: build the dashboard and triage an incident 240 min

SLI, SLO, and error budgets: reliability in numbers

SLI is a good/total ratio; SLO is the target; error budget is 1 − SLO. MWMBR alerting, error budget policy, SLO platforms, and the cultural adoption pattern that turns arithmetic into decisions.

01 SLI, SLO, and the error budget: reliability by the numbers 12 min 02 Choosing SLIs and SLO targets: ratios, not feelings 14 min 03 Multi-window multi-burn-rate alerting: why AND beats OR 15 min 04 Error budget policy, latency SLOs, and composite journeys 16 min 05 SLO platforms and the 90-day rollout 13 min 06 Low-traffic SLOs and burn-rate math from first principles 17 min 07 Iceberg SLIs, composite SLO math, and SLA vs SLO 16 min 08 Production SLO failures, self-observability, security, and the big picture 18 min 09 SLO and error budgets: multiple-choice review 13 min 10 SLO and error budgets: free-recall review 14 min 11 SLO and error budgets: PromQL and rule reading 14 min 12 SLO and error budgets: instrument a journey end to end 240 min

Trace propagation: the headers that stitch services together

Why the W3C traceparent header is the load-bearing 55-byte string that turns 50 disconnected services into one navigable trace, how baggage carries context across async boundaries, and how head vs tail sampling decide which traces survive.

01 What is trace propagation and why broken propagation is worse than none 10 min 02 traceparent and tracestate: the W3C header format in full 13 min 03 Baggage and async boundaries: carrying context across queues and callbacks 14 min 04 Head sampling and tail sampling: deciding which traces survive 13 min 05 Sampling consistency and the tail-sampling Collector tier 14 min 06 Async context per language, service mesh, B3 migration, and security 16 min 07 Production propagation failures, span links, and platform design 18 min 08 Trace propagation: multiple-choice review 13 min 09 Trace propagation: free-recall review 14 min 10 Trace propagation: code and header reading 14 min 11 Trace propagation: stitch a broken system into one trace 240 min

Profiling: where the CPU and the bytes actually went

How sampling profilers turn an unfair share of CPU into a flame graph you can read in 60 seconds, how eBPF and continuous profiling watch production at 2-5% overhead, and how on-CPU vs off-CPU profiles answer different questions about the same slow request.

01 Flame graphs: reading the picture that shows where time goes 12 min 02 Sampling vs instrumentation profiling: why 99 Hz wins in production 13 min 03 Profile types: CPU, memory, off-CPU, mutex — which one to reach for 15 min 04 Continuous profiling: always-on flame graphs with eBPF and trace-id correlation 16 min 05 How flame graphs are built from samples, and the production workflows that use them 15 min 06 Linux perf, eBPF internals, PGO, and the limits of sampling 18 min 07 Profiling in production: security, war stories, OTel profiles, and the infrastructure design 18 min 08 Profiling: multiple-choice review 13 min 09 Profiling: free-recall review 14 min 10 Profiling: profile and config reading 14 min 11 Profiling: from SLO to flame graph 240 min

Putting it together: a production observability story

How RED + USE + SLO + traces + profiles compose into one debugging loop, how OpenTelemetry unifies four signals through one SDK and one wire format, and what 'observability that pays for itself' actually means at production scale.

01 The debugging funnel: SLO → RED → trace → profile 10 min 02 OTel architecture: one SDK, four signals, one wire format 14 min 03 Cost discipline: keeping observability under 5% of infra spend 13 min 04 The incident loop: from pager to postmortem to prevention 14 min 05 Scale, security, and the ROI of observable systems 18 min 06 Observability capstone: multiple-choice synthesis 14 min 07 Observability capstone: free-recall review 13 min 08 Observability capstone: reading signals and queries 14 min 09 Observability capstone: instrument a service and debug an incident 240 min

Build with this track

Guided projects that exercise what you learn here.

◆ Projects

Collaborative cursors

Show every connected user's live cursor and selection in a shared document, conflict-free, over WebSocket.

◆ Projects

A concurrent Go ingest service

Build a concurrent ingest/fan-out worker in Go — then operate it: bound the work, apply backpressure, make downstream calls survive failure, ship it in a minimal container, and work a goroutine-leak incident before it eats your memory.

◆ Projects

Grounded RAG Service

A RAG demo that answers from a corpus is easy; a RAG service you'd trust in front of users is not. The hard part isn't retrieval, it's grounding: making the model say only what the retrieved text supports, attaching citations the reader can check, and proving with an eval set that the answers don't drift into confident fiction. You'll build the whole loop — chunk, embed, store, retrieve top-k, ground, cite, score — and feel exactly where it leaks.

◆ Projects

Job scheduler

A cron + backoff job runner with at-least-once delivery, idempotent handlers, and visibility timeouts — so no job is silently lost even when workers crash mid-execution.

◆ Projects

A Next.js app to production

Build a multi-tenant content app on the App Router — then run it: lock down auth and secrets, layer the caches, decide every edge-vs-node call, and work the incident when one tenant poisons a shared ISR page.

◆ Projects

Mini OAuth 2.0 + PKCE login

Implement the authorization-code + PKCE flow end to end against a real provider, so you understand every redirect and token instead of trusting a library.

◆ Projects

Async Python service, built and operated

Build an async FastAPI ingestion service that validates, pipelines, and survives load — then run it: package it, containerize it with correct PID-1 behaviour, and work the incident when a swallowed CancelledError quietly leaks tasks until the event loop starves.

◆ Projects

Observability

Start from zero

Three pillars: metrics, logs, and traces

Structured logging: schema, levels, redaction

OpenTelemetry: API, SDK, Collector, OTLP

RED and USE: the two halves of every dashboard

SLI, SLO, and error budgets: reliability in numbers

Trace propagation: the headers that stitch services together

Profiling: where the CPU and the bytes actually went

Putting it together: a production observability story

Build with this track

Collaborative cursors

A concurrent Go ingest service

Grounded RAG Service

Job scheduler

A Next.js app to production

Mini OAuth 2.0 + PKCE login

Async Python service, built and operated

Distributed rate limiter

React feature at scale

URL shortener at scale

Virtual data grid

Deployment & Infra