Observability OBS · 01 · 06

Observability 2.0: wide events and the cost shift

Why Charity Majors and the Honeycomb team argue that three pillars are a 1990s storage artifact, how wide events collapse them, and when the 1.0 split is still cost-optimal.

OBS Senior ◷ 13 min

Level

FoundationsJuniorMiddleSenior

A 40-service team spends two engineer-weeks per quarter on cardinality budgets and three hours per incident on cross-pillar pivots. The engineering cost of three separate observability backends now exceeds the SaaS bill of a unified one. The pillars were not wrong — they were pre-2020 cost-optimal.

The 2.0 critique

Charity Majors (Honeycomb co-founder, the engineer who put “observability” on the map for distributed systems in 2017) argues in her 2024–2025 blog series: the three pillars are not a natural taxonomy — they are an artifact of late-2000s storage limits.

Metrics existed because you could not afford to keep every event.
Logs existed because you could not aggregate them at query time.
Traces existed because neither of the other two carried cross-service causality.

With columnar storage engines, cheap object storage ($0.023/GB on S3), and modern query planners, all three collapse back into one shape: wide structured events — one record per unit of work, carrying every field including high-cardinality ones (user_id, build_sha, feature_flag_state, region), stored in a single columnar engine, queried with arbitrary slicing.

Honeycomb’s product is this. ClickHouse-based stacks (Signoz, Sentry’s new backend, Cloudflare’s internal stack) are this. The Honeycomb post “OpenTelemetry Is Not Three Pillars” explicitly recasts OTel signals as “ways to ship and store a single underlying telemetry model.”

What changes at the engineering level

In a 1.0 stack each request emits to three places:

Counter increments → metric SDK
Log lines → log SDK
Spans → trace SDK

Each signal is sampled or aggregated differently, fields are denormalised, and the join across them depends on shared join keys being plumbed consistently. Three billing meters, three operational surfaces.

In a 2.0 stack each request emits one wide event per service boundary — a JSON record with all attributes (user_id, route, status, duration, build_sha, feature_flags, trace_id) — and the backend computes metrics by GROUP BY at query time, retrieves logs by filtering, reconstructs traces by joining on trace_id. The columnar engine makes all three queries fast over the same data. Cardinality is no longer a cost driver in 2.0 — that is the headline shift.

Property	Observability 1.0	Observability 2.0
Storage shape	Three separate backends	One columnar store
Billing	series-month + GB-ingest + span-month	~traffic-month + dimensionality
Cardinality as cost	Yes — OOM risk on TSDB	No — query-time GROUP BY
Cross-signal pivot	Join keys + exemplars required	Same data, arbitrary slice
Backend options	Many, open-source paths	Fewer, often SaaS-only

1.0 (top row): three pre-aggregated pillars in separate backends, each fed and queried on its own — pivoting across them needs shared join keys. 2.0 (bottom): one columnar wide-event store, one row per request with every field, from which metrics (GROUP BY), traces (join on trace_id), and logs (filter) are all derived at query time over the same data.

The migration decision

The 2.0 architecture is not universally superior. The senior question is: “is the cost basis of 2.0 viable for my workload?”

2.0 wins when:

Cardinality budget overruns cost more engineer-hours than the bill saves
Cross-pillar pivot frustration bottlenecks incident response
Traffic volume and architectural sprawl drive the cost, not storage shape

1.0 wins when:

Very long retention (5+ years) for low-cardinality metrics — pre-aggregation is hard to beat on cold storage
Strong open-source preference, avoiding SaaS lock-in
Budget is dominated by metrics, not log or trace volume

Many shops run both: 2.0 backend for incident response and ad-hoc questions, 1.0 metrics tier (Prometheus + remote storage) for 5-year SLO trend dashboards and regulatory reports.

The two lists above share a common pattern: 2.0 wins when the hidden engineering cost of discipline outweighs the SaaS premium; 1.0 wins when query volume over cold storage dominates. If you find yourself spending more time managing label budgets than writing features, the economic crossover has passed.

The migration choice is economic, not ideological: 2.0 wins when hidden engineering cost (cardinality discipline, cross-pillar pivots) exceeds the SaaS premium; 1.0 wins when cheap query volume over cold low-cardinality data dominates the bill.

The migration path is dual-write, not rip-and-replace: point OTel exporters at both backends for 60–90 days, build 2.0 dashboards alongside 1.0 ones, decommission 1.0 only after every team’s on-call run-books have moved.

The vendor-lock-in question

1.0 instrumentation historically coupled application code to a vendor SDK (dd-trace, New Relic agent). Switching vendors meant rewriting instrumentation. The OTel bet: one open API + OTLP wire format = swap backends without touching application code.

In practice, OTel APIs cover traces and metrics well. The OTel Logs API stabilised in late 2023; SDKs in late 2024. Semantic Conventions for HTTP and database are GA; messaging and FaaS are still maturing. The honest senior take: OTel is the right bet for new instrumentation in 2026 if portability matters. OTel SDKs coexist with vendor SDKs during transition.

Observability 2.0 capacity numbers

Honeycomb event ingestion ceiling per dataset: ~1B events / hour
ClickHouse-based observability typical compression: 10:1 to 30:1
Wide-event row size, fully-instrumented: ~2–10 KB / event
OTLP-gRPC vs HTTP JSON wire size: ~50–70% smaller, ~2–5x faster encode
OTel Collector throughput (commodity box): ~50–200k spans/sec
Dual-write migration window (typical): 60–90 days

Quiz

A 40-service team spends ~2 engineer-weeks/quarter on cardinality budgets and ~3 hours/incident on cross-pillar pivots. Which architecture is the better fit?

Quiz

What is the primary reason long-retention low-cardinality metrics are sometimes cheaper to keep in a 1.0 TSDB than in a 2.0 wide-event store?

Recall before you leave

01
Articulate the 1.0 vs 2.0 cost difference in billing terms, and name the workload where each is the better economic choice.
02
What are the three common failure modes during a 1.0 to 2.0 migration?
03
What is the irreversible engineering cost of staying on 1.0 when cardinality budgets are routinely exceeded?

Recap

The three-pillar taxonomy emerged from late-2000s storage cost cliffs: you could not afford raw event retention (metrics), real-time aggregation (logs), or full-fidelity causality on every request (traces). With columnar storage and cheap object storage, wide structured events — one record per unit of work with all attributes — collapse all three into one queryable surface. 2.0 pricing is ~traffic-month plus dimensionality; cardinality is no longer a cost driver, eliminating the productivity tax of label budgets. 2.0 wins for incident response and ad-hoc investigation at 30+ services. 1.0 wins for very-long-retention low-cardinality dashboards where pre-aggregation is still the cheapest cold-storage shape. Many mature shops run both, and the migration path is dual-write over 60–90 days. Now when someone on your team proposes a new metric label for user-level data, you will know whether to push back on cardinality grounds or to ask instead whether the team has crossed the threshold where a 2.0 backend would dissolve the constraint entirely.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 6 done

Connected lessons

builds on

Join keys and exemplars: making the three signals composemiddle

unlocks

Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior

appears again in170

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.