awesome-everything RU
↑ Back to the climb

Observability

Sampling strategies and log cost

Crux Success-path, pattern-based, and tail sampling are the three levers that keep the log bill proportional to incidents, not to traffic.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at middle altitude — in the sky
◷ 12 min

A team turns on structured logging. Six weeks later their log bill is three times higher than their compute bill. Every service, every request, every INFO line — indexed and billed. Logging costs more than serving traffic. The service is fine. The sampling policy is missing.

The cost equation

The cost of a log line is paid in three places: at write time (CPU + RAM for serialization), in transit (network egress + collector capacity), and at the backend (ingest GB + indexed-event count + retention bytes).

A modest service emitting one log line per request handles 1 MB/s at 1000 req/s — about 86 GB/day. At hosted log pricing ($0.10/GB ingest plus indexed-event cost), the bill compounds fast across dozens of services.

Service loadDaily volumeMonthly ingest cost
1000 req/s, 1 KB/log~86 GB~$260 (one service)
Same, with 1-in-10 INFO sampling~10 GB~$30
10 services at 1000 req/s~860 GB~$2600/month raw

Three sampling strategies

Success-path sampling keeps 1-in-N INFO lines for high-volume successful events while keeping 100% of WARN and ERROR. Typical N is 10 to 100. This is the first lever to pull: it cuts INFO volume by 90% without touching failure forensics.

Pattern-based sampling keeps every distinct log pattern at full rate and samples the duplicates. Vector and Fluent Bit ship sampling filters that hash the message template — so a “retry attempt 1 of 5” pattern is kept at 1-in-100 while “payment_declined” (rare) stays at 100%.

Tail sampling for logs mirrors the trace pattern: buffer logs for a request window, then decide based on outcome — keep all logs for failed requests, sample the successful ones. This is the most powerful strategy but requires a stateful buffer at the collector tier. It guarantees zero loss of failure context while discarding up to 99% of success-path volume.

Why this works

The pipeline tier (collector / agent) is the right place for sampling — not the application. Sampling at the collector keeps application code simple and lets the platform team manage the policy centrally. The anti-pattern is baking sampling into each service individually, which fragments the policy and makes it hard to change consistently across the fleet.

The shipping pipeline

Logs travel through three stages: emit (the application writes JSON to stdout or a logger SDK), collect (a sidecar agent or DaemonSet reads stdout, parses JSON, batches, applies sampling), ship (OTLP-HTTP or native protocol to the backend).

The collector layer — Fluent Bit, Vector, OTel Collector with the filelog receiver — does three things you do not want in the application: backpressure (buffer on disk if the backend is slow), enrichment (attach resource attributes from pod metadata), and redaction (strip PII patterns before they leave the host).

Production rule: emit JSON to stdout, let the platform handle everything after that.

Retention tiering: hot, warm, cold

Indexed log storage at $0.10-$1.00/GB-month is too expensive for multi-month retention at scale. Mature stacks tier:

  • Hot (last 7-15 days, fully indexed, sub-second query — Datadog Standard, Loki recent)
  • Warm (30-90 days, partially indexed or scan-only — Datadog Flex, Splunk Frozen-Searchable)
  • Cold (compliance retention in S3 or equivalent at $0.023/GB-month, restorable but not directly queryable)

An incident under 7 days old runs against hot tier with full query power. An investigation into “what happened 6 months ago” needs warm-tier queries that may take minutes per scan and may not have every dimension indexed.

Structured logs: cost and capacity numbers
Pino throughput (Node 24, 1 core)
~140k msg/sec
Winston throughput (same workload)
~20k msg/sec
Typical structured log size
~0.5-2 KB
Service @ 1000 req/s, 1 log/req
~86 GB / day
Datadog log ingest
~$0.10 / GB
Datadog indexed events (standard tier)
~$1.27 / million
Hot tier retention typical
7-15 days
Cold tier (S3) cost
~$0.023 / GB-month
Quiz

A team applies 1-in-10 sampling to all log lines including ERROR. What is the problem?

Quiz

What is tail sampling for logs, and what makes it different from success-path sampling?

Order the steps

Order these log cost-control levers from cheapest to most complex to implement:

  1. 1 Set INFO as production default, turn off DEBUG globally
  2. 2 Apply success-path sampling (1-in-10 INFO, 100% WARN/ERROR) at the collector
  3. 3 Add pattern-based sampling to collapse chatty duplicate patterns
  4. 4 Configure retention tiering: hot 15d, warm 90d, cold S3
  5. 5 Implement tail sampling with per-request buffering at a central collector gateway
Recall before you leave
  1. 01
    A service emits 1 KB JSON logs at 1000 req/s. What is the rough monthly ingest bill, and what is the standard cost-control lever to cut it by 90% without losing failure forensics?
  2. 02
    Why does sampling belong at the collector tier rather than in the application?
  3. 03
    What is the retention tiering model and why does the hot/warm/cold split matter for incident response?
Recap

Log cost compounds because every structured line is indexed and billed per event and per GB. A service at 1000 req/s emits ~86 GB/day — and most fleets have dozens of services. The three sampling levers: success-path sampling (1-in-10 INFO, 100% WARN/ERROR) cuts volume by 90% with zero loss of failure context; pattern-based sampling collapses chatty duplicate patterns at the collector; tail sampling buffers per-request logs and keeps everything for failures, sampling only successes. All three belong at the collector tier for centralized policy control. Pair sampling with retention tiering — hot (7-15d), warm (30-90d), cold (S3) — to keep the audit trail without the full indexed-storage bill.

Connected lessons
appears again in167
Continue the climb ↑PII redaction and log injection
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.