Observability
OTel: multiple-choice review
Six questions that cut across the whole unit. Each mirrors a real architecture or incident decision — not a definition to recite, but where to make a change and why it lands there.
Confirm you can connect the four pieces — API, SDK, Collector, OTLP — to the decisions they govern: where portability lives, what makes signals correlate, how sampling and the Collector pipeline compose, and how the whole thing fails in production.
A platform team claims full OTel vendor-neutrality. Two facts surface in review: application code imports the vendor's dd-trace SDK directly, and a Collector sits in front of every service converting to OTLP. Are they neutral, and why?
A team needs tail sampling that keeps 100% of error traces. They run a single agent Collector as a DaemonSet (one per node) exporting straight to the backend. Why does this not work, and what is the minimal fix?
Two services emit HTTP telemetry. Service A tags the route as http.route; service B tags it as http_route. OTLP export succeeds for both and no SDK error fires. What actually breaks, and where?
In a production traces pipeline you have memory_limiter, batch, and tail_sampling. Why does memory_limiter belong first, and what is the consequence of putting tail_sampling first?
A metrics backend's cardinality explodes after a fleet rollout of OTel auto-instrumentation. The new dimension driving it is the HTTP client's url.full attribute. What is the root cause and the correct fix?
Your single OTel Collector gateway crashes during an incident. The on-call dashboards stay green and no alerts fire, yet user-facing errors are spiking. What is the lesson, and the structural fix?
The through-line: portability is decided at the application edge (emit OTLP there or nowhere), correlation is decided by Semantic Conventions (consistent names or invisible services), and the Collector is where policy lives — processor order protects it, sampling composes head-cheap with tail-curated, and the gateway must be HA and self-monitored because when the Collector fails, your visibility fails silently. Every production failure in this unit — eroded neutrality, split traces, naming drift, OOM, cardinality leaks, blind dashboards — resolves back to one of those four pieces.