Observability
Profiling in production: security, war stories, OTel profiles, and the infrastructure design
Stripe’s continuous profiler caught a regression two days after a deploy that no dashboard showed. A new feature flag was reading from disk on every request instead of from in-memory cache. CPU profile looked normal; off-CPU profile showed the disk wait. The fix was one line. The detection would have taken weeks without continuous profiling.
Profiles are security-sensitive artefacts
A profile contains function names (often private), call patterns, and sometimes allocation arguments — enough to reverse-engineer business logic. Some profilers capture argument values at allocation sites; poorly configured allocation profilers have leaked credentials.
In hostile contexts, a profile from a competitor’s binary can reveal proprietary algorithms — function names alone often telegraph what a service does. eBPF profilers running on shared kernels can in principle observe other tenants’ execution; this is why eBPF requires explicit capabilities and is namespace-scoped on modern kernels.
Production discipline:
- Profiles are RBAC-gated by team (Pyroscope tenancy model).
- Retention limited to 30-90 days; exports require approval.
- Never shipped outside the organisation.
- eBPF agent runs with
CAP_PERFMONonly, not full root. - Audit log of who pulled which profile.
Production war stories
Discord 2020: a chat service ran at 80% CPU with mysterious tail latency. CPU profile pointed at JSON serialisation. Switching to a faster JSON library dropped CPU to 30% and tail latency to baseline.
GitHub 2021: Ruby workers were OOMing on certain endpoints. Allocation profile showed a single template-rendering function allocating 200 MB per request because of an unbounded loop concatenating strings.
Stripe 2022: continuous profiling caught a regression two days after deploy. A new feature flag read from disk on every request instead of from in-memory cache. CPU profile looked normal; off-CPU profile showed the disk wait. Fix was one line.
Cloudflare 2023: a Worker runtime regression appeared in eBPF profiles as time spent in V8’s GC. The team rolled back a V8 update that introduced more aggressive collection.
Slack 2024: PHP service was spending 30% of CPU on autoloader. Profiler-guided opcache tuning cut it to 5%.
The shared pattern: every major engineering org has a profiling war story. The common thread: dashboards showed normal, but the profile showed the bottleneck. The fix was obvious from the flame graph; impossible to find without one.
| Company / Year | Symptom | Profile type | Root cause |
|---|---|---|---|
| Discord 2020 | 80% CPU, tail latency | CPU flame graph | JSON serialisation hotspot |
| GitHub 2021 | OOM on endpoints | Allocation profile | String concat loop, 200 MB/req |
| Stripe 2022 | Post-deploy regression | Off-CPU profile | Feature flag disk read on every req |
| Cloudflare 2023 | Worker runtime regression | eBPF CPU profile | V8 GC update, more aggressive collection |
| Slack 2024 | High PHP CPU | CPU flame graph | Autoloader: 30% CPU, fixed with opcache |
OTel profile signal: the fourth pillar
OpenTelemetry is standardising profiles as a fourth signal (after logs, metrics, traces). The spec defines:
- A profile data model: samples with stacks, labels, and time ranges.
- A transport: OTLP profile signal (added in 2024).
- Integration with context propagation: trace-id tagging on every sample.
Adoption status: Datadog, Grafana, Honeycomb, Splunk are implementing OTel profile ingestion. Agents (OTel Collector + profiler side) emit OTel-formatted profiles. The OTel profile spec is in beta as of 2026 — most production deployments still use vendor-specific formats (pprof, JFR, Pyroscope-native). Choosing a tool today commits to a format for 2-3 years; the OTel trajectory is worth tracking.
The promise: cross-vendor portability and a unified collector pipeline — the same architecture as logs, metrics, and traces. The catch: the spec is young and implementations diverge at the edges.
Designing continuous profiling infrastructure
A 200-service polyglot platform (Go, Java, Node, Python) with the requirement to surface deploy regressions in 1 hour and enable trace-to-profile drill in under 30 seconds:
Layer 1 — Collection: eBPF DaemonSet on every node (Parca-style or Pyroscope eBPF) as the universal baseline — covers all languages, one agent per node. Per-language agents as supplements: pprof for Go, async-profiler for Java, py-spy for Python. The eBPF agent is the catchall; per-language agents provide allocation and mutex profiles.
Layer 2 — Backend: self-hosted Pyroscope 2.0 cluster. Object storage (S3 / GCS) with 30-day fine-grained retention and 90-day downsampled. Symbol deduplication keeps per-service storage under 10 GB/month.
Layer 3 — Trace correlation: profiles carry trace-id and span-id labels. Grafana links trace span → Pyroscope filtered by trace-id. Sub-30-second drill.
Layer 4 — Regression detection: CI job on every deploy: capture 5-minute profile of new version under canary traffic, diff against previous version’s profile, post flame-graph diff as PR comment, fail CI if a new function appears in top 5 by self-CPU. Hourly production diff against same-hour-yesterday baseline; Slack alert on shape changes.
Layer 5 — Cost controls: sample rate per service configurable in service.yaml (default 99 Hz; drop to 19 Hz for cheap baseline services). Budget alert at 80% of monthly cost ceiling.
- Trace-to-profile drill time
- <30 seconds
- Deploy regression detection window
- <1 hour
- Pager-to-git-blame MTTR
- <90 seconds
- Storage per service per month
- <10 GB (Pyroscope 2.0)
- eBPF capability required
- CAP_PERFMON only
- Profile RBAC
- Per-team tenancy
A profile from your service leaks to a vendor's support team. What is the security concern?
The OTel profile signal is in beta as of 2026. What is the practical implication for teams choosing a profiling tool today?
- 01Why are profiles treated as security-sensitive artefacts rather than just operational data?
- 02Design the profiling CI gate for a 50-service platform to catch CPU regressions at deploy time.
- 03What is the OTel profile signal and what does it standardise?
Profiles contain function names, call patterns, and sometimes allocation argument values — treat them as security-sensitive artefacts with RBAC, audit logs, and retention limits, never shared externally without approval. Five industry war stories (Discord, GitHub, Stripe, Cloudflare, Slack) follow the same pattern: dashboards showed normal, the profile showed the bottleneck, the fix was obvious from the flame graph. The OTel profile signal standardises profiles as the fourth observability pillar with a data model, OTLP transport, and trace-id integration; it is in beta as of 2026 but worth tracking when choosing tooling. Production profiling infrastructure for a 200-service polyglot fleet combines an eBPF DaemonSet (universal baseline), per-language native agents (depth), Pyroscope 2.0 self-hosted (storage), trace-id correlation (30-second drill), and CI differential profiles (1-hour regression detection). The cultural shift: senior on-call engineers in 2026 open the profile dashboard the same reflexive way they opened traces two years ago.
appears again in167
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior