Performance
Profile first: run the loop on a misleading service
Reading about profile-first is not the same as catching intuition lying to you. Build a service whose obvious-looking bottleneck is a decoy, then run the full measurement loop until the profile — not your gut — names the real cost, and prove every step with numbers.
Turn the unit’s mental model into a reproducible engineering loop: quantify the complaint, reproduce under realistic load, read the flame graph, predict the win with Amdahl, fix the real hotspot, and verify with statistically valid before/after metrics — confirming intuition was wrong before you trust the measurement.
Take an HTTP service with a deliberately misleading slow endpoint (your own or the starter described below) and run the complete profile-first loop — quantify, reproduce, profile, predict, fix, verify — driving p99 to target while proving with a profile that the obvious suspect was NOT the bottleneck.
- A before/after table: p50, p95, p99 endpoint latency across ≥5 runs each, with a 95% CI — measured under identical load, not estimated, and showing p99 met the stated target.
- A short write-up stating the pre-profile intuition, the profile that contradicted it, the real hotspot with its CPU (or off-CPU) share, and the Amdahl-predicted vs actual total speedup.
- The profile diff clearly shows the real hotspot frame shrinking after the fix (re-profiled, not assumed), and the decoy suspect confirmed as a small share throughout.
- Evidence of the observer-effect check: the headline metric with profiler off vs on, within ~5%.
- Add a one-page triage runbook: how to quantify a 'slow' complaint, pick the measurement scope (micro/macro/prod), read flame-graph shapes, and verify a win statistically — the loop a new on-call should follow.
- Wire a profile-diff CI gate: deploy the PR branch to a canary, run a 5-minute load test, capture CPU + off-CPU profiles, diff against a weekly-refreshed main baseline, and fail the build if any function's CPU share grows ≥10% absolute or ≥30% relative.
- Add hardware-counter analysis (perf stat -e cycles,instructions,cache-misses) on the real hotspot; report its IPC and decide from the number whether the next fix should target data layout or the algorithm.
- Add a cold-start profile (first 30–60 s after launch) alongside the steady-state one and show that the hotspots differ — then name a cold-start-specific fix (eager warmup, connection pre-warming, or AOT) the steady-state profile would never have suggested.
This is the loop you will run in every real performance investigation: quantify the complaint into a target, reproduce under realistic load, verify the profiler is not lying (observer effect), read the flame graph by shape before names, let the measurement contradict your intuition, predict the ceiling with Amdahl, fix the one real hotspot, and confirm with median/p95/p99 across multiple runs. Building it once on a service whose obvious bottleneck is a decoy is what makes “profile first, then change code” muscle memory instead of a slogan.