Performance
Detecting N+1: query logs, APM traces, and CI gates
The page is 1.2 seconds. The CPU flame graph is flat. Your colleague insists the profiler must be wrong. The profiler is not wrong — you are looking at the wrong instrument. N+1 lives in the off-CPU wait time between queries, not in any function.
Why CPU profiles miss N+1
A CPU profiler samples stack traces of running threads. The N+1 problem does not burn CPU in any single location. Each query dispatches in microseconds; the program then waits for the database (off-CPU); the response returns fast; the next query dispatches. The wait is the bottleneck, but the wait is off-CPU.
The profiler sees fast blips with nothing wide. The cumulative cost is spread across many fast call sites — not concentrated in one slow function. A CPU flame graph will look healthy while the page takes 1.2 seconds.
The right instruments
1. Database query log
The most direct instrument. Every major database can log every query with its duration and a timestamp.
# Postgres — log everything in development
log_min_duration_statement = 0
# MySQL slow query log
slow_query_log = 1
long_query_time = 0
# ORM-level logging
# Rails: config.log_level = :debug (ActiveRecord logger prints each query)
# Django: LOGGING = { 'loggers': { 'django.db.backends': { 'level': 'DEBUG' } } }
# SQLAlchemy: echo=TrueThe N+1 signature is unmistakable: dozens to hundreds of similar queries for one request, often the same SELECT differing only in the WHERE id = ? parameter.
2. APM tracing — the waterfall view
Application performance monitoring tools (Datadog APM, Grafana Tempo, New Relic, Honeycomb with auto-instrumentation) capture database spans inline with code execution spans.
The N+1 waterfall looks like a tall column of short parallel bars — 50 or 100 short database spans all within the same request’s trace. One glance at the trace view makes the pattern obvious in a way that logs alone do not.
Alert when per-endpoint p95 span count grows more than 30% week-over-week.
3. Per-request query counter
Custom middleware that counts queries per request and either logs or alerts on excess:
# Rails — middleware counts queries per request
ActiveSupport::Notifications.subscribe('sql.active_record') do
RequestStore[:query_count] ||= 0
RequestStore[:query_count] += 1
end
after_action do
if RequestStore[:query_count] > 20
Rails.logger.warn "QUERY_COUNT #{params[:controller]}##{params[:action]}: #{RequestStore[:query_count]}"
end
endExpose as a Prometheus histogram by endpoint. Alert when endpoint p99 exceeds 20 or grows more than 50% week-over-week.
| Instrument | Best for | Signature |
|---|---|---|
| CPU profiler | Compute hotspots | Wrong tool — misses N+1 |
| Query log | Confirm N+1, find call site | Many similar SELECTs per request |
| APM trace | Visual waterfall, production | Tall column of short DB spans |
| Query counter | Alerting, per-endpoint trends | queries_per_request p99 > 20 |
The CI gate — catch at PR time
The most leveraged anti-N+1 mechanism is catching regressions at PR time instead of in production. Teams running this gate report 70–90% reduction in N+1 incidents reaching production. Initial setup: 1–2 weeks. Ongoing maintenance: trivial — the gate runs in CI.
Rails: the bullet gem raises on any lazy-loaded relation accessed during request render. A controller test that triggers N+1 fails.
# spec/spec_helper.rb
Bullet.enable = true
Bullet.raise = true # turns N+1 into a test failureDjango: assertNumQueries(N) in the test framework counts exact queries per view:
def test_orders_page(self):
with self.assertNumQueries(2): # fails if N+1 sneaks in
response = self.client.get('/orders/')SQLAlchemy / Hibernate: enable query count statistics and assert per-test:
# SQLAlchemy — assert query count via events
# Hibernate — assert via Statistics.getPrepareStatementCount()The pattern across all of them: PR triggers tests, tests fire requests, query counter exceeds the per-endpoint threshold, test fails, PR cannot merge.
Why this works
Off-CPU profiling (eBPF off-CPU profiles, Java async-profiler in wall-clock mode, JFR socket-wait events) is a fourth option. It shows the application spending most wall-time waiting on the database socket, not running code. Off-CPU profiling is the definitive instrument for the class of problem where “program is slow but CPU profile is healthy” — which is the N+1 signature.
Walking a real query log
# tail -f log/development.log | grep -E "SELECT|Loading"
Started GET "/dashboard"
User Load (0.4ms) SELECT "users".* FROM "users" WHERE "users"."id" = 42
Project Load (0.6ms) SELECT "projects".* FROM "projects" WHERE "projects"."user_id" = 42 LIMIT 50
Task Load (0.3ms) SELECT "tasks".* FROM "tasks" WHERE "tasks"."project_id" = 1
Task Load (0.3ms) SELECT "tasks".* FROM "tasks" WHERE "tasks"."project_id" = 2
Task Load (0.4ms) SELECT "tasks".* FROM "tasks" WHERE "tasks"."project_id" = 3
... (47 more Task Load lines)
Comment Load (0.2ms) SELECT "comments".* FROM "comments" WHERE "comments"."task_id" = 1
Comment Load (0.2ms) SELECT "comments".* FROM "comments" WHERE "comments"."task_id" = 2
... (243 more Comment Load lines)
Completed 200 OK in 980ms (Views: 250.3ms | ActiveRecord: 689.4ms)Reading this: 980 ms total, 689 ms of which is ActiveRecord across 300+ queries. Two N+1 levels:
- Level 2 (projects → tasks): 50 separate Task Load queries, one per project.
- Level 3 (tasks → comments): 200–300 separate Comment Load queries, one per task.
Fix: projects.includes(tasks: :comments).limit(50) in Rails (or the equivalent in your ORM). From 300 queries to 4 — one for user, one for projects, one for tasks (IN), one for comments (IN). ActiveRecord time: 689 ms → ~15 ms.
An APM trace shows 142 database spans per request on the /dashboard endpoint, while p99 is 1.8 s. The spans are short (1–3 ms each). What is the diagnosis?
A PR adds a new serializer that accesses order.customer.name for each order in a list. Tests pass. What CI mechanism would have caught this N+1 before it shipped?
Order the layers of N+1 detection, from cheapest to find to hardest to find:
- 1 Development query log — enable ORM logger, count lines per request
- 2 CI gate — bullet/assertNumQueries fails PR before it ships
- 3 APM trace in staging — waterfall shows DB span columns
- 4 Production alert — queries_per_request p99 spikes on a live endpoint
- 01Why does a CPU profile miss N+1, and what three instruments surface it correctly?
- 02Describe the CI gate pattern for preventing N+1 regressions from shipping.
N+1 is off-CPU: the program waits between queries, not inside any function. That is why CPU profiles look healthy while the page is slow. The right instruments are the query log (repeated similar SELECTs), APM traces (tall column of short DB spans), and per-request query counters (alert at p99 > 20). The most leveraged prevention is a CI gate — Rails bullet or Django assertNumQueries — that fails PRs introducing lazy-load regressions before they reach production. Query count is the metric that matters; p99 latency is the result.
appears again in159
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Migration failure taxonomy and production disciplinesenior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior