Observability OBS · 02 · 04

Sampling strategies and log cost

Success-path, pattern-based, and tail sampling are the three levers that keep the log bill proportional to incidents, not to traffic.

OBS Middle ◷ 12 min

Level

FoundationsJuniorMiddleSenior

A team turns on structured logging. Six weeks later their log bill is three times higher than their compute bill. Every service, every request, every INFO line — indexed and billed. Logging costs more than serving traffic. The service is fine. The sampling policy is missing.

The cost equation

The cost of a log line is paid in three places: at write time (CPU + RAM for serialization), in transit (network egress + collector capacity), and at the backend (ingest GB + indexed-event count + retention bytes).

A modest service emitting one log line per request handles 1 MB/s at 1000 req/s — about 86 GB/day. At hosted log pricing ($0.10/GB ingest plus indexed-event cost), the bill compounds fast across dozens of services.

Service load	Daily volume	Monthly ingest cost
1000 req/s, 1 KB/log	~86 GB	~$260 (one service)
Same, with 1-in-10 INFO sampling	~10 GB	~$30
10 services at 1000 req/s	~860 GB	~$2600/month raw

The same traffic costs $260, $30, or $2600 a month depending only on sampling policy — one cheap lever (1-in-10 INFO) cuts the bill by ~90%.

Three sampling strategies

Success-path sampling keeps 1-in-N INFO lines for high-volume successful events while keeping 100% of WARN and ERROR. Typical N is 10 to 100. This is the first lever to pull: it cuts INFO volume by 90% without touching failure forensics.

Pattern-based sampling keeps every distinct log pattern at full rate and samples the duplicates. Vector and Fluent Bit ship sampling filters that hash the message template — so a “retry attempt 1 of 5” pattern is kept at 1-in-100 while “payment_declined” (rare) stays at 100%.

Tail sampling for logs mirrors the trace pattern: buffer logs for a request window, then decide based on outcome — keep all logs for failed requests, sample the successful ones. This is the most powerful strategy but requires a stateful buffer at the collector tier. It guarantees zero loss of failure context while discarding up to 99% of success-path volume.

Together these three strategies give you control over cost at different points in the pipeline: success-path sampling is the cheap first lever, pattern-based sampling handles chatty duplicates the first lever misses, and tail sampling is the precision instrument when you need a zero-loss failure guarantee. Without at least the first lever, your log bill tracks traffic volume, not incidents.

The gate keeps every ERROR and rare event at 100%, samples high-volume INFO successes down to 1-in-N, and forwards only the reduced stream to indexed storage.

▸Why this works

The pipeline tier (collector / agent) is the right place for sampling — not the application. Sampling at the collector keeps application code simple and lets the platform team manage the policy centrally. The anti-pattern is baking sampling into each service individually, which fragments the policy and makes it hard to change consistently across the fleet.

The shipping pipeline

Logs travel through three stages: emit (the application writes JSON to stdout or a logger SDK), collect (a sidecar agent or DaemonSet reads stdout, parses JSON, batches, applies sampling), ship (OTLP-HTTP or native protocol to the backend).

The collector layer — Fluent Bit, Vector, OTel Collector with the filelog receiver — does three things you do not want in the application: backpressure (buffer on disk if the backend is slow), enrichment (attach resource attributes from pod metadata), and redaction (strip PII patterns before they leave the host).

Production rule: emit JSON to stdout, let the platform handle everything after that.

Retention tiering: hot, warm, cold

Indexed log storage at $0.10-$1.00/GB-month is too expensive for multi-month retention at scale. Mature stacks tier:

Hot (last 7-15 days, fully indexed, sub-second query — Datadog Standard, Loki recent)
Warm (30-90 days, partially indexed or scan-only — Datadog Flex, Splunk Frozen-Searchable)
Cold (compliance retention in S3 or equivalent at $0.023/GB-month, restorable but not directly queryable)

An incident under 7 days old runs against hot tier with full query power. An investigation into “what happened 6 months ago” needs warm-tier queries that may take minutes per scan and may not have every dimension indexed.

Structured logs: cost and capacity numbers

Pino throughput (Node 24, 1 core): ~140k msg/sec
Winston throughput (same workload): ~20k msg/sec
Typical structured log size: ~0.5-2 KB
Service @ 1000 req/s, 1 log/req: ~86 GB / day
Datadog log ingest: ~$0.10 / GB
Datadog indexed events (standard tier): ~$1.27 / million
Hot tier retention typical: 7-15 days
Cold tier (S3) cost: ~$0.023 / GB-month

Quiz

A team applies 1-in-10 sampling to all log lines including ERROR. What is the problem?

Quiz

What is tail sampling for logs, and what makes it different from success-path sampling?

Order the steps

Order these log cost-control levers from cheapest to most complex to implement:

1 Set INFO as production default, turn off DEBUG globally
2 Apply success-path sampling (1-in-10 INFO, 100% WARN/ERROR) at the collector
3 Add pattern-based sampling to collapse chatty duplicate patterns
4 Configure retention tiering: hot 15d, warm 90d, cold S3
5 Implement tail sampling with per-request buffering at a central collector gateway

Recall before you leave

01
A service emits 1 KB JSON logs at 1000 req/s. What is the rough monthly ingest bill, and what is the standard cost-control lever to cut it by 90% without losing failure forensics?
02
Why does sampling belong at the collector tier rather than in the application?
03
What is the retention tiering model and why does the hot/warm/cold split matter for incident response?

Recap

Log cost compounds because every structured line is indexed and billed per event and per GB. A service at 1000 req/s emits ~86 GB/day — and most fleets have dozens of services. The three sampling levers: success-path sampling (1-in-10 INFO, 100% WARN/ERROR) cuts volume by 90% with zero loss of failure context; pattern-based sampling collapses chatty duplicate patterns at the collector; tail sampling buffers per-request logs and keeps everything for failures, sampling only successes. All three belong at the collector tier for centralized policy control. Pair sampling with retention tiering — hot (7-15d), warm (30-90d), cold (S3) — to keep the audit trail without the full indexed-storage bill. Now when you see a log bill that tracks your traffic curve instead of your incident curve, you know which lever to reach for first.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

Log levels and alert routingmiddle

unlocks

PII redaction and log injectionsenior

deepens into

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.