awesome-everything RU
↑ Back to the climb

Observability

Observability 2.0: wide events and the cost shift

Crux Why Charity Majors and the Honeycomb team argue that three pillars are a 1990s storage artifact, how wide events collapse them, and when the 1.0 split is still cost-optimal.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 13 min

A 40-service team spends two engineer-weeks per quarter on cardinality budgets and three hours per incident on cross-pillar pivots. The engineering cost of three separate observability backends now exceeds the SaaS bill of a unified one. The pillars were not wrong — they were pre-2020 cost-optimal.

The 2.0 critique

Charity Majors (Honeycomb co-founder, the engineer who put “observability” on the map for distributed systems in 2017) argues in her 2024–2025 blog series: the three pillars are not a natural taxonomy — they are an artifact of late-2000s storage limits.

  • Metrics existed because you could not afford to keep every event.
  • Logs existed because you could not aggregate them at query time.
  • Traces existed because neither of the other two carried cross-service causality.

With columnar storage engines, cheap object storage ($0.023/GB on S3), and modern query planners, all three collapse back into one shape: wide structured events — one record per unit of work, carrying every field including high-cardinality ones (user_id, build_sha, feature_flag_state, region), stored in a single columnar engine, queried with arbitrary slicing.

Honeycomb’s product is this. ClickHouse-based stacks (Signoz, Sentry’s new backend, Cloudflare’s internal stack) are this. The Honeycomb post “OpenTelemetry Is Not Three Pillars” explicitly recasts OTel signals as “ways to ship and store a single underlying telemetry model.”

What changes at the engineering level

In a 1.0 stack each request emits to three places:

  • Counter increments → metric SDK
  • Log lines → log SDK
  • Spans → trace SDK

Each signal is sampled or aggregated differently, fields are denormalised, and the join across them depends on shared join keys being plumbed consistently. Three billing meters, three operational surfaces.

In a 2.0 stack each request emits one wide event per service boundary — a JSON record with all attributes (user_id, route, status, duration, build_sha, feature_flags, trace_id) — and the backend computes metrics by GROUP BY at query time, retrieves logs by filtering, reconstructs traces by joining on trace_id. The columnar engine makes all three queries fast over the same data. Cardinality is no longer a cost driver in 2.0 — that is the headline shift.

PropertyObservability 1.0Observability 2.0
Storage shapeThree separate backendsOne columnar store
Billingseries-month + GB-ingest + span-month~traffic-month + dimensionality
Cardinality as costYes — OOM risk on TSDBNo — query-time GROUP BY
Cross-signal pivotJoin keys + exemplars requiredSame data, arbitrary slice
Backend optionsMany, open-source pathsFewer, often SaaS-only

The migration decision

The 2.0 architecture is not universally superior. The senior question is: “is the cost basis of 2.0 viable for my workload?”

2.0 wins when:

  • Cardinality budget overruns cost more engineer-hours than the bill saves
  • Cross-pillar pivot frustration bottlenecks incident response
  • Traffic volume and architectural sprawl drive the cost, not storage shape

1.0 wins when:

  • Very long retention (5+ years) for low-cardinality metrics — pre-aggregation is hard to beat on cold storage
  • Strong open-source preference, avoiding SaaS lock-in
  • Budget is dominated by metrics, not log or trace volume

Many shops run both: 2.0 backend for incident response and ad-hoc questions, 1.0 metrics tier (Prometheus + remote storage) for 5-year SLO trend dashboards and regulatory reports.

The migration path is dual-write, not rip-and-replace: point OTel exporters at both backends for 60–90 days, build 2.0 dashboards alongside 1.0 ones, decommission 1.0 only after every team’s on-call run-books have moved.

The vendor-lock-in question

1.0 instrumentation historically coupled application code to a vendor SDK (dd-trace, New Relic agent). Switching vendors meant rewriting instrumentation. The OTel bet: one open API + OTLP wire format = swap backends without touching application code.

In practice, OTel APIs cover traces and metrics well. The OTel Logs API stabilised in late 2023; SDKs in late 2024. Semantic Conventions for HTTP and database are GA; messaging and FaaS are still maturing. The honest senior take: OTel is the right bet for new instrumentation in 2026 if portability matters. OTel SDKs coexist with vendor SDKs during transition.

Observability 2.0 capacity numbers
Honeycomb event ingestion ceiling per dataset
~1B events / hour
ClickHouse-based observability typical compression
10:1 to 30:1
Wide-event row size, fully-instrumented
~2–10 KB / event
OTLP-gRPC vs HTTP JSON wire size
~50–70% smaller, ~2–5x faster encode
OTel Collector throughput (commodity box)
~50–200k spans/sec
Dual-write migration window (typical)
60–90 days
Quiz

A 40-service team spends ~2 engineer-weeks/quarter on cardinality budgets and ~3 hours/incident on cross-pillar pivots. Which architecture is the better fit?

Quiz

What is the primary reason long-retention low-cardinality metrics are sometimes cheaper to keep in a 1.0 TSDB than in a 2.0 wide-event store?

Recall before you leave
  1. 01
    Articulate the 1.0 vs 2.0 cost difference in billing terms, and name the workload where each is the better economic choice.
  2. 02
    What are the three common failure modes during a 1.0 to 2.0 migration?
  3. 03
    What is the irreversible engineering cost of staying on 1.0 when cardinality budgets are routinely exceeded?
Recap

The three-pillar taxonomy emerged from late-2000s storage cost cliffs: you could not afford raw event retention (metrics), real-time aggregation (logs), or full-fidelity causality on every request (traces). With columnar storage and cheap object storage, wide structured events — one record per unit of work with all attributes — collapse all three into one queryable surface. 2.0 pricing is ~traffic-month plus dimensionality; cardinality is no longer a cost driver, eliminating the productivity tax of label budgets. 2.0 wins for incident response and ad-hoc investigation at 30+ services. 1.0 wins for very-long-retention low-cardinality dashboards where pre-aggregation is still the cheapest cold-storage shape. Many mature shops run both, and the migration path is dual-write over 60–90 days.

Connected lessons
appears again in167
Continue the climb ↑Failure modes and engineering practice: cardinality budgets, PII, and sampling
shortcuts expand
search
K
prev piece
k
next piece
j
cycle tier
t
this menu
?
sources3
expand
  1. 01
  2. 02
  3. 03

Trademarks belong to their respective owners. Editorial reference only.