Databases
Production failure modes and the index audit playbook
A DELETE FROM posts WHERE id = 42 with ON DELETE CASCADE takes 7 minutes and holds row-level locks the entire time. Root cause: comments(post_id) is not indexed. Postgres does a sequential scan of 200M comments to find the rows to delete. One missing index on a foreign key caused a production incident. This lesson catalogs the failure modes and the playbook to find them before they find you.
The seven failure modes
1. Missing index on FK column
Foreign-key columns are not automatically indexed. When you add REFERENCES posts(id), Postgres creates an index on posts(id) (the referenced column), but not on comments(post_id) (the referencing column). ON DELETE CASCADE on the parent table does a sequential scan of the child table to find rows to cascade to.
-- This is the dangerous pattern:
ALTER TABLE comments ADD CONSTRAINT fk_post FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE;
-- No index on comments(post_id) -- DELETE FROM posts is now O(n) on comments
-- Fix:
CREATE INDEX CONCURRENTLY idx_comments_post_id ON comments(post_id);Rule: index every FK column unless you can prove the parent table is never deleted and no queries filter the child by the FK.
2. Implicit type coercion defeats the index
-- orders.user_id is BIGINT; index on orders(user_id) exists
SELECT * FROM orders WHERE user_id = '42'; -- '42' is TEXTPostgres applies an implicit cast from TEXT to BIGINT — but to do so on every row, it cannot walk the B-tree (the cast is applied to the column, not the constant). Result: Seq Scan. Diagnosis: EXPLAIN ANALYZE on the query. Fix: use typed parameters ($1::BIGINT or the correct type in the query builder).
Other common coercion traps: created_at::date = '2024-01-01' defeats an index on created_at; use created_at >= '2024-01-01' AND created_at < '2024-01-02' instead.
3. Stale statistics cause wrong plan choice
The planner’s cost model depends on pg_statistic data — value distributions, row counts, correlations. After a bulk insert, large delete, or schema change, autovacuum schedules an ANALYZE, but by default only after 20% of rows change. Until ANALYZE runs, the planner has wrong estimates.
Symptom: EXPLAIN shows rows=50 but actual was rows=500,000 — a 10,000x miss. The planner picked a nested-loop with an index that is catastrophic at that row count.
Fix: run ANALYZE table_name after bulk data operations. For tables with highly skewed data (e.g., 90% of orders are for 5 workspaces), use CREATE STATISTICS (ndistinct) ON workspace_id, status FROM orders to give the planner better multi-column statistics.
4. Index bloat slows scans
Indexes accumulate dead entries from updates and deletes. Postgres 14+ does bottom-up index deletion (aggressive cleanup on hot leaf pages without waiting for vacuum), but heavy-update tables can still develop significant bloat over time.
Symptom: index size is 5-10x the expected size for the row count; queries are slow despite using the index.
-- Rebuild the index without blocking reads/writes
REINDEX INDEX CONCURRENTLY idx_orders_user_id;REINDEX CONCURRENTLY (Postgres 12+) builds a new index in the background and then swaps it in. Duration is similar to the initial CREATE INDEX CONCURRENTLY build.
5. Wrong composite column order
A composite index (status, workspace_id) where most queries filter by workspace_id alone does not accelerate those queries. The planner falls back to Seq Scan.
This is the leading-column rule violation discussed in lesson 02. In production, it often appears when an index was designed for one query but a second hot query was added later with a different filter pattern.
Fix: add a second index with the correct leading column, or redesign the composite if the original query shape is less common.
6. JSONB GIN cardinality bomb
A JSONB column where documents contain millions of unique keys creates a GIN index with millions of distinct posting-list entries. This index overwhelms shared_buffers (evicting hot heap pages) and has enormous write cost (updating the posting list for every unique key on each insert).
Symptom: GIN index size is 10-50x the underlying column size; shared_buffers cache-hit ratio drops; insert latency spikes after adding the GIN.
Fix: use expression B-tree indexes on known-hot fields instead of a GIN on the whole column:
-- Instead of: CREATE INDEX ON events USING GIN (payload)
-- Use:
CREATE INDEX ON events ((payload->>'event_type'));
CREATE INDEX ON events ((payload->>'user_id'));This indexes only two known-hot fields with small B-tree indexes instead of a monolithic GIN.
7. “Has an index” but still slow
The most insidious failure: the right index exists, but the plan is still slow because:
- The index is not covering (heap fetches dominate at scale).
- The ORDER BY column is not in the index (full sort after index scan).
- The index’s leading column does not match the most selective filter.
Diagnosis is always EXPLAIN (ANALYZE, BUFFERS) on the exact query with realistic parameters — not a contrived test query, not a staging environment with 100x fewer rows.
The quarterly index audit playbook
Run this playbook every quarter on every production Postgres database. It typically takes 1-2 hours and returns 10-30% storage savings plus measurable write-throughput improvements.
| Step | Query / tool | Action |
|---|---|---|
| 1. Unused indexes | pg_stat_user_indexes WHERE idx_scan = 0 | DROP CONCURRENTLY after verifying not a constraint or batch-job index |
| 2. Bloated indexes | pgstattuple_approx() on hot indexes | REINDEX CONCURRENTLY on indexes with >30% bloat |
| 3. Missing indexes | pg_stat_statements top-N by total_time + EXPLAIN | CREATE INDEX CONCURRENTLY for Seq Scans on large tables |
| 4. Redundant indexes | pg_index catalog — find prefix duplicates | DROP CONCURRENTLY the shorter prefix when a composite covers it |
| 5. VM health | pg_stat_user_tables.n_dead_tup | VACUUM if n_dead_tup > 5% of live rows on IOS-dependent tables |
Step 1: find unused indexes
SELECT
schemaname,
relname AS table_name,
indexrelname AS index_name,
idx_scan,
pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
AND indexrelname NOT LIKE '%_pkey'
ORDER BY pg_relation_size(indexrelid) DESC;Caveats: (a) reset pg_stat_reset() happens on restart — if the server was recently restarted, the scan count is misleading; use a monitoring system that tracks these over time. (b) Some indexes are used only by periodic batch jobs — cross-reference pg_stat_statements over a longer window. (c) Unique indexes on constraint columns (UNIQUE, exclusion constraints) may show idx_scan = 0 but are still enforcing the constraint on every insert.
Step 2: find bloated indexes
-- Requires pgstattuple extension
SELECT
indexrelid::regclass AS index_name,
pg_size_pretty(pg_relation_size(indexrelid)) AS total_size,
round(100 * (approx_free_space + dead_tuple_len)::numeric
/ GREATEST(1, pg_relation_size(indexrelid)), 2) AS bloat_pct
FROM (
SELECT indexrelid, (pgstattuple_approx(indexrelid::regclass)).*
FROM pg_stat_user_indexes
) AS s
ORDER BY bloat_pct DESC;Indexes with bloat_pct over 30% are candidates for REINDEX CONCURRENTLY. Schedule during off-peak.
Step 3: find missing indexes
Pull the top-20 queries from pg_stat_statements by total_exec_time. For each, run EXPLAIN (ANALYZE, BUFFERS) with realistic parameters. Look for Seq Scan on tables over 10k rows where the filter is selective (rows returned is small fraction of total). Each such query is a missing-index candidate.
Step 4: find redundant indexes
Two indexes are redundant when one’s key columns are a prefix of another’s. idx_on_a and idx_on_a_b — the idx_on_a is redundant if the idx_on_a_b composite is always used instead. Drop the shorter one.
-- Find potential duplicates (manual inspection required)
SELECT
i.indexrelid::regclass AS index_name,
array_to_string(array_agg(a.attname ORDER BY x.n), ', ') AS columns
FROM pg_index i
JOIN pg_class c ON c.oid = i.indrelid
CROSS JOIN LATERAL unnest(i.indkey) WITH ORDINALITY AS x(attnum, n)
JOIN pg_attribute a ON a.attrelid = i.indrelid AND a.attnum = x.attnum
GROUP BY i.indexrelid, i.indrelid
ORDER BY i.indrelid, columns;Index strategy across environments
Different environments have different needs:
Development: minimal indexes — only PKs and essential unique constraints. The goal is fast schema iteration, not query performance. Adding all production indexes slows migrations and obscures schema design (you should not lean on indexes to compensate for bad schema choices).
Staging: mirror production indexes when running load tests or benchmarking; skip them for pure functional testing. The staging index set helps catch “works in dev, slow in staging” before production.
Production: the full deliberate index set, with all the controls described in this unit — covering composites, partial indexes, INCLUDE, regular audits. Production index additions go through migration review alongside the query that requires them.
Replicas / analytics copies: may have additional OLAP-specific indexes (wide composites, BRIN on date ranges, GIN for full-text) that would be too expensive to maintain on the OLTP primary. If using logical replication to a dedicated analytics Postgres, add those indexes only on the replica.
Strategic posture
The single most common performance postmortem at production scale: “missing index on this hot query.” The second most common: “too many indexes slowing writes.” Both have the same root cause — indexes were not treated as part of the design.
Senior teams:
- Add the index to the migration that ships the feature (query + index in one PR).
- Review indexes in code review alongside the SQL.
- Own index strategy at the platform level, not per-team.
- Audit quarterly.
The cost of this discipline: one checklist item per PR touching SQL. The cost of not doing it: a 3am page when a customer dashboard times out because someone added a new FK last week without indexing the referencing column.
- Audit cadence
- quarterly
- Typical storage savings from audit
- 10-30%
- Typical write-rate improvement from dropping unused
- 2-5x on write-heavy tables
- REINDEX CONCURRENTLY duration
- similar to initial CREATE INDEX CONCURRENTLY
- pgstattuple_approx bloat threshold for reindex
- >30%
- idx_scan = 0 window (use monitoring, not just reset)
- >30 days
- pg_stat_statements top-N for missing-index scan
- top 20 by total_exec_time
- FK column index: enforced by Postgres?
- No — manual index required
Query has an index but is still slow — diagnose
slow_query: SELECT id, total_cents FROM orders WHERE workspace_id = 42 AND status = 'pending' ORDER BY created_at DESC LIMIT 50;
execution_time: 4280 ms
rows_returned: 50
EXPLAIN ANALYZE:
Limit (cost=320..380 rows=50 width=24) (actual time=4271..4280 rows=50 loops=1)
-> Sort (cost=320..18420 rows=180000 width=24) (actual time=4270..4275 rows=50 loops=1)
Sort Key: created_at DESC
Sort Method: top-N heapsort Memory: 32kB
-> Index Scan using idx_orders_workspace_status on orders
(cost=0.42..18000 rows=180000 width=24) (actual time=0.02..3800 rows=178240 loops=1)
Index Cond: ((workspace_id = 42) AND (status = 'pending'::text))
Index definitions:
idx_orders_workspace_status: btree (workspace_id, status)
idx_orders_workspace_status size: 1.2 GB
table size: 18 GB, row count: 80M
Statistics:
n_dead_tup: 12.4M (15% of total)
last_autovacuum: 14 days ago
last_analyze: 30 days ago Why is the query 4280 ms despite using the index? What is the complete fix?
Walk the full quarterly index audit on a 500 GB production Postgres database.
A query runs: WHERE LOWER(email) = 'alice@x.com'. The table has an index on (email). Why does the query do a Seq Scan?
A 200 MB index on a 2 GB table shows idx_scan = 0 for the last 45 days. What is the correct action?
lesson.inset.warning
Never drop an index just because idx_scan = 0. Always verify: (1) is it a unique/exclusion constraint index? (2) is it used by a periodic batch job that runs monthly or quarterly, outside the monitoring window? (3) was the server recently restarted, resetting all counters? Cross-reference pg_stat_statements over 30+ days and ask the application team before dropping.
- 01Name the seven production index failure modes and for each give the one-sentence diagnostic.
- 02Walk the quarterly index audit in five steps.
- 03What is the correct index strategy difference between development, staging, and production?
Seven failure modes cover the production index incidents: missing FK index (unindexed child column causes O(n) cascade scans); implicit type coercion (function on column disables index — use typed parameters); stale statistics (wrong plan from outdated row estimates — run ANALYZE after bulk ops); index bloat (REINDEX CONCURRENTLY on over-30% bloated indexes); wrong composite order (leading-column violation — redesign or add second index); JSONB GIN cardinality bomb (narrow to expression indexes on known-hot fields); and “has an index but still slow” (wrong structure — check EXPLAIN ANALYZE for Sort steps and Heap Fetches).
The quarterly audit: (1) DROP CONCURRENTLY unused indexes (idx_scan=0 for 30+ days, not constraints). (2) REINDEX CONCURRENTLY bloated indexes. (3) CREATE CONCURRENTLY missing indexes found via pg_stat_statements. (4) DROP CONCURRENTLY redundant prefix indexes. (5) Monitor for one week after changes.
Index strategy by environment: minimal in dev (fast iteration), mirror-production in staging for benchmarks, full deliberate set in production. OLAP-specific indexes belong on analytics replicas, not the OLTP primary.
appears again in258
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- The event loop: one thread, three queuesjunior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Microtask starvation, Long Tasks, and LoAFsenior
- Node.js event loop: phases, nextTick, and loop lagsenior
- React, Vue, and INP observability in productionsenior
- The render pipeline: six stages from bytes to pixelsjunior
- Stage costs and the renderer process modelmiddle
- Invalidation, dirty bits, and containmiddle
- Compositor layers: promotion, overlap, and GPU memorymiddle
- DevTools flame strip and the frame lifecyclemiddle
- Layout thrash: forced synchronous layoutsenior
- BeginMainFrame, compositor-driven animations, and GPU memorysenior
- Production observability: LoAF, INP, and the full attack surfacesenior
- What V8 is and why performance varies 100×junior
- V8''''s four-tier JIT pipeline and profile-guided tieringmiddle
- Hidden classes, transition trees, and memory layoutmiddle
- Inline caches, IC states, and deoptimizationmiddle
- Orinoco GC: parallel scavenger, concurrent marking, and write barriersmiddle
- TurboFan''''s speculative engine and the deopt-loop trapsenior
- V8 in production: isolates, pointer compression, and real failuressenior
- Service worker lifecycle and cache strategiesmiddle
- Service worker edge cases: version skew, durability, and navigation trapssenior
- What the reconciler does: render vs commitjunior
- The fiber object and the double-buffer treemiddle
- Render phase purity and commit phase sub-stepsmiddle
- Reconciliation: diffing heuristics and the key trapmiddle
- Priority lanes, time-slicing, and useTransitionmiddle
- Bailout, memoisation, and tearingsenior
- React Profiler, the Compiler, and production observabilitysenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Hydration mismatch: causes, detection, and the determinism rulesenior
- RSC, per-route strategy, and production observabilitysenior
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- CLS: why layout shifts happen and how to stop themmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Production SLO failures, self-observability, security, and the big picturesenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- The incident loop: from pager to postmortem to preventionmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Cache lines, struct layout, and false sharingmiddle
- Branch prediction and branchless codemiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior