Observability
Structured logging: build a production logging pipeline
Reading about structured logging is not the same as standing up a pipeline that survives an audit and an incident. Build a small service, give it a real log schema, route the levels, sample the volume, redact the PII, correlate by trace_id, and split out the audit log — then prove each property with a query or a test, not a claim.
Turn the unit into a working pipeline: emit OTel-shaped JSON to stdout, let a collector handle sampling, redaction, and routing, and demonstrate that operational and audit logs end up in the right place with the right retention, immutability, access, and trace correlation.
Take a small HTTP service (Node/Go/JVM/Python) and build a production-grade structured logging pipeline end to end: an OTel-shaped schema, level-driven routing, collector-tier sampling, two-layer PII redaction, automatic trace_id correlation, and a separated audit-log subsystem — proving each property with a query, a test, or a before/after measurement.
- A captured log line from each endpoint showing the full OTel schema, correct level, and a non-zero trace_id — including the async path's line carrying the inbound trace_id.
- A before/after volume table proving success-path sampling cut INFO by ~90% while a query confirms 100% of WARN/ERROR survived.
- Evidence the sensitive endpoint leaks no PII: a query over the indexed logs for the test email/password/token returns zero hits, and the redaction holds even when the test deliberately dumps the whole request body on error.
- The CWE-117 test passes: the forged-newline comment produces exactly one log record, not two.
- An audit event lands in the audit index (not the operational index) and a query shows the operational index's retention/access policy differs from the audit index's.
- A one-paragraph write-up: which layer caught what (source deny-list vs collector scrubber), why sampling is severity-aware, and what breaks if audit and operational logs share an index.
- Add a trace_id-health panel and alert: compute the fraction of lines with trace_id = all-zeros per service and alert above 1%, then deliberately break an async path to watch it fire.
- Add retention tiering: hot (7-15d, fully indexed), warm (30-90d, scan-only), cold (S3 ~$0.023/GB-month) and show an old-incident query running against the warm tier.
- Add pattern-based sampling that collapses a chatty duplicate log template at the collector while keeping rare patterns at 100%, and measure the additional volume cut.
- Make the audit log tamper-evident: hash-chain or sign each audit record and write a verifier that detects a single altered entry.
This is the pipeline you stand up for every real service: emit OTel-shaped JSON to stdout, classify levels so routing follows the contract, auto-inject trace_id and survive async boundaries, redact PII at the source and again at the collector, prevent log injection structurally by passing input as typed fields, sample at the collector while keeping all WARN/ERROR, and split the audit log into its own pipeline with the right retention, immutability, and access. Doing it once on a toy service — with a query or test proving each property — makes the production version muscle memory and audit-ready.