Observability
What is OpenTelemetry: API, SDK, Collector, OTLP
A company switches observability backends to cut cost. With the vendor SDK hard-wired, the migration estimate is four engineer-weeks. With OTel, it is one day — change an exporter config block, redeploy the Collector, done.
The four pieces
Before OTel, every observability vendor shipped a proprietary agent — Datadog’s dd-trace, New Relic’s agent, AppDynamics’s Java agent — and changing vendor meant rewriting instrumentation across every service. The CNCF’s answer is OpenTelemetry: four pieces layered so each is replaceable and each speaks a stable contract.
API — language-specific public interfaces (Java’s io.opentelemetry.api, Python’s opentelemetry.trace, Go’s go.opentelemetry.io/otel). This is what application code and third-party libraries call. It is intentionally lightweight: the default implementation is a no-op. The application code never imports a vendor library — it imports the OTel API.
SDK — the runtime piece that turns API calls into telemetry records. The SDK owns sampling decisions (which traces to keep), batching (when to flush to the exporter), and serialization (OTLP wire format). It is installed by the application owner, not the library author.
Collector — a standalone process (binary or container) that receives OTLP, runs configurable processors, and exports to one or more backends. This is the policy layer: tail sampling, redaction, multi-backend routing — all in YAML, outside the application.
OTLP — the wire format. Protobuf-encoded messages over gRPC or HTTP, defined in the OTel specification and stable across versions. Any pair of OTel-aware components communicate over OTLP.
| Piece | Lives in | Role | Replaceable? |
|---|---|---|---|
| API | Application code | Stable public interface; no-op by default | Stable — never changes |
| SDK | Application runtime | Builds records, samples, batches, serializes | Yes — swap SDK without touching app code |
| Collector | Sidecar / gateway | Process, route, export telemetry | Yes — update config, not code |
| OTLP | Wire format | Carries spans/metrics/logs between pieces | Stable — all pieces speak it |
The postal metaphor
Picture a national postal system. The API is the mailbox at your house — you drop a letter in, you do not care how it gets sorted. The SDK is the local post-office staff who pick up the letter, weigh it, stamp it, put it in the right outgoing bag. The Collector is the regional sorting depot where mail from many houses meets, gets batched, filtered, and routed by destination. OTLP is the standard envelope and address format every depot understands. Change the destination country (the backend vendor)? Same envelope, different routing table.
The portability story
Bea the platform engineer is told the company is moving from Datadog to Honeycomb to cut cost. She panics: “Do we rewrite every service?” Sven the backend developer reassures her: “We are on OTel — application code calls only the OTel API, the SDK emits OTLP to the Collector, and the Collector exports to whichever backend the config says. We change one block in the Collector YAML — exporter, endpoint, API key — and re-deploy. No app code changes.” Two days later the migration is done.
Why this works
The API/SDK split solves the third-party library problem. Before OTel, a library that wanted to emit telemetry had to pick a vendor (locking its users to that vendor) or build its own abstraction. With OTel, a library depends only on the OTel API package — a small set of interfaces with a no-op default — so it can emit spans without forcing any SDK on its users. The application owner installs whichever SDK they choose at deploy time. This is why “instrument once, route anywhere” is architecturally sound rather than just marketing.
Which four pieces make up the OTel architecture?
A team has 50 microservices instrumented with OTel and wants to add a new tracing backend. Where do they make the change?
Order the path of a single span from application code to the backend:
- 1 Application code calls tracer.start_span() (OTel API)
- 2 OTel SDK builds the span record (start time, attributes, trace context)
- 3 When the span ends, SDK passes it to a batch span processor
- 4 Processor batches spans and hands them to an exporter
- 5 Exporter sends OTLP-encoded data over gRPC or HTTP to the Collector
- 6 Collector receives, applies processors (tail sampling, redaction), batches
- 7 Collector exports to one or more backends (Datadog, Honeycomb, Tempo, ...)
Fill in the blank: _______ is the standard envelope and address format every OTel-aware piece understands — change the recipient, the envelope stays the same.
- 01In two sentences, what is the difference between the OTel API and the OTel SDK?
- 02Why does OTel split the API from the SDK, and what concrete problem does this solve?
- 03What is the vendor-neutrality contract in one sentence?
OpenTelemetry is four pieces stacked: the API (what application code imports — a stable, vendor-free interface), the SDK (the runtime that turns API calls into telemetry records, batches them, and exports OTLP), the Collector (a standalone process that receives OTLP, runs configurable processors like tail sampling and redaction, and exports to any backend), and OTLP (the protobuf wire format all pieces speak). In 2026 every major vendor accepts OTLP — Datadog, Honeycomb, Grafana Cloud, Elastic, Splunk, New Relic. The portability contract is: emit OTLP at the edge, change backends in a Collector YAML, never rewrite instrumentation.
appears again in40
- Federation and lookahead: batching beyond DataLoadermiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- Production observability: LoAF, INP, and the full attack surfacesenior
- Hidden classes, transition trees, and memory layoutmiddle
- V8 in production: isolates, pointer compression, and real failuressenior
- What workers are and why they existjunior
- Web worker mechanics: dedicated, shared, and OffscreenCanvasmiddle
- Structured clone and transferablesmiddle
- SharedArrayBuffer, Atomics, and cross-origin isolationsenior
- Worker pools, Comlink, and production observabilitysenior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- JSONB, arrays, and when a side table winsmiddle
- Schema integrity: deferral, versioning, and production failure modessenior
- Where data fetching happens — and why it decides LCPjunior
- React Server Components and Suspense streamingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- The IP envelopejunior
- Reading the IP headermiddle
- What TLS does and why it existsjunior
- Key schedule, SNI, ALPN, and extensionssenior
- 0-RTT defenses, ECH, hybrid PQ, and production TLSsenior
- The twelve layers: one URL, seven actorsjunior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior