Observability
Sampling strategies and log cost
A team turns on structured logging. Six weeks later their log bill is three times higher than their compute bill. Every service, every request, every INFO line — indexed and billed. Logging costs more than serving traffic. The service is fine. The sampling policy is missing.
The cost equation
The cost of a log line is paid in three places: at write time (CPU + RAM for serialization), in transit (network egress + collector capacity), and at the backend (ingest GB + indexed-event count + retention bytes).
A modest service emitting one log line per request handles 1 MB/s at 1000 req/s — about 86 GB/day. At hosted log pricing ($0.10/GB ingest plus indexed-event cost), the bill compounds fast across dozens of services.
| Service load | Daily volume | Monthly ingest cost |
|---|---|---|
| 1000 req/s, 1 KB/log | ~86 GB | ~$260 (one service) |
| Same, with 1-in-10 INFO sampling | ~10 GB | ~$30 |
| 10 services at 1000 req/s | ~860 GB | ~$2600/month raw |
Three sampling strategies
Success-path sampling keeps 1-in-N INFO lines for high-volume successful events while keeping 100% of WARN and ERROR. Typical N is 10 to 100. This is the first lever to pull: it cuts INFO volume by 90% without touching failure forensics.
Pattern-based sampling keeps every distinct log pattern at full rate and samples the duplicates. Vector and Fluent Bit ship sampling filters that hash the message template — so a “retry attempt 1 of 5” pattern is kept at 1-in-100 while “payment_declined” (rare) stays at 100%.
Tail sampling for logs mirrors the trace pattern: buffer logs for a request window, then decide based on outcome — keep all logs for failed requests, sample the successful ones. This is the most powerful strategy but requires a stateful buffer at the collector tier. It guarantees zero loss of failure context while discarding up to 99% of success-path volume.
Why this works
The pipeline tier (collector / agent) is the right place for sampling — not the application. Sampling at the collector keeps application code simple and lets the platform team manage the policy centrally. The anti-pattern is baking sampling into each service individually, which fragments the policy and makes it hard to change consistently across the fleet.
The shipping pipeline
Logs travel through three stages: emit (the application writes JSON to stdout or a logger SDK), collect (a sidecar agent or DaemonSet reads stdout, parses JSON, batches, applies sampling), ship (OTLP-HTTP or native protocol to the backend).
The collector layer — Fluent Bit, Vector, OTel Collector with the filelog receiver — does three things you do not want in the application: backpressure (buffer on disk if the backend is slow), enrichment (attach resource attributes from pod metadata), and redaction (strip PII patterns before they leave the host).
Production rule: emit JSON to stdout, let the platform handle everything after that.
Retention tiering: hot, warm, cold
Indexed log storage at $0.10-$1.00/GB-month is too expensive for multi-month retention at scale. Mature stacks tier:
- Hot (last 7-15 days, fully indexed, sub-second query — Datadog Standard, Loki recent)
- Warm (30-90 days, partially indexed or scan-only — Datadog Flex, Splunk Frozen-Searchable)
- Cold (compliance retention in S3 or equivalent at $0.023/GB-month, restorable but not directly queryable)
An incident under 7 days old runs against hot tier with full query power. An investigation into “what happened 6 months ago” needs warm-tier queries that may take minutes per scan and may not have every dimension indexed.
- Pino throughput (Node 24, 1 core)
- ~140k msg/sec
- Winston throughput (same workload)
- ~20k msg/sec
- Typical structured log size
- ~0.5-2 KB
- Service @ 1000 req/s, 1 log/req
- ~86 GB / day
- Datadog log ingest
- ~$0.10 / GB
- Datadog indexed events (standard tier)
- ~$1.27 / million
- Hot tier retention typical
- 7-15 days
- Cold tier (S3) cost
- ~$0.023 / GB-month
A team applies 1-in-10 sampling to all log lines including ERROR. What is the problem?
What is tail sampling for logs, and what makes it different from success-path sampling?
Order these log cost-control levers from cheapest to most complex to implement:
- 1 Set INFO as production default, turn off DEBUG globally
- 2 Apply success-path sampling (1-in-10 INFO, 100% WARN/ERROR) at the collector
- 3 Add pattern-based sampling to collapse chatty duplicate patterns
- 4 Configure retention tiering: hot 15d, warm 90d, cold S3
- 5 Implement tail sampling with per-request buffering at a central collector gateway
- 01A service emits 1 KB JSON logs at 1000 req/s. What is the rough monthly ingest bill, and what is the standard cost-control lever to cut it by 90% without losing failure forensics?
- 02Why does sampling belong at the collector tier rather than in the application?
- 03What is the retention tiering model and why does the hot/warm/cold split matter for incident response?
Log cost compounds because every structured line is indexed and billed per event and per GB. A service at 1000 req/s emits ~86 GB/day — and most fleets have dozens of services. The three sampling levers: success-path sampling (1-in-10 INFO, 100% WARN/ERROR) cuts volume by 90% with zero loss of failure context; pattern-based sampling collapses chatty duplicate patterns at the collector; tail sampling buffers per-request logs and keeps everything for failures, sampling only successes. All three belong at the collector tier for centralized policy control. Pair sampling with retention tiering — hot (7-15d), warm (30-90d), cold (S3) — to keep the audit trail without the full indexed-storage bill.
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- Structured logging: build a production logging pipelinesenior
- Structured logging: multiple-choice reviewsenior
- Structured logging: code and log readingsenior
- Structured logging: free-recall reviewsenior
appears again in167
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior