Browser & Frontend Runtime
Metric tradeoffs, RUM attribution, and the CI+field loop
A team fixes LCP by inlining the full critical CSS bundle — LCP render delay drops 400 ms. But TTFB rises 60 ms because the HTML is now 40 KB larger. A separate team lazy-loads every non-hero image to cut the bundle — LCP stays fine, but the hero was accidentally caught and LCP regresses 800 ms. Another team ships a full SPA and gets green vitals on the first load — but every client-side route change is invisible to the metrics. One budget, three metrics, four teams, and no single knob.
The metrics are not independent dials.
Fixing one Core Web Vital can hurt another, and the senior move is to see the system, not individual metrics. Common tradeoffs:
- Inlining a large critical CSS block helps LCP render delay (no render-blocking request) but bloats the HTML, hurting TTFB. The net effect depends on connection speed and HTML size.
- Lazy-loading everything to shrink the bundle helps INP (less JS to parse and execute) but, applied accidentally to the hero image, adds a large load delay and wrecks LCP.
- Reserving generous space for CLS — large
min-heightcontainers for ads — can push the LCP element below the fold so it is no longer the LCP candidate. Sometimes that is fine (the new LCP candidate is already fast). Sometimes it is not. - Shipping a large interactive framework for a snappy feel adds hydration — one large long task that spikes INP for early interactions. The same feature simultaneously hurts INP and can hurt LCP (JS blocks the render-blocking parse path).
There is no single knob. The discipline: measure all three before and after any change, in the field where possible, in a throttled lab at minimum. A change that improves one metric at the cost of another may or may not be worth it — but you cannot know unless you look at all three.
INP attribution in the field: from metric to line of code.
A field INP number alone is not actionable — it is a latency with no cause attached. The attribution chain:
PerformanceObserversubscribed toevententries gives you, per interaction, the input delay / processing time / presentation delay split.- Long Animation Frames (
long-animation-frameentries viaPerformanceObserver) give you, for the frame where a slow interaction landed, the array of scripts that ran — with source URLs, function names, and durations. - Combine them: high input-delay INP → find the long task in the LoAF entry; high processing-time INP → the handler, attributed by LoAF; high presentation-delay INP → a heavy render or forced synchronous layout.
The whole point of wiring LoAF into RUM is to turn “p75 INP is 340 ms” into “the search filter function at search.js:88 is the dominant script in slow frames.” That is actionable.
LCP attribution in the field.
The web-vitals library’s onLCP callback delivers not just the LCP time but the element (so you know what the browser picked as LCP), the URL of the resource (if it is an image), and the four-phase breakdown: TTFB, load delay, load time, render delay. Logging these to telemetry means a regression from “LCP went from 1.9 s to 3.8 s” immediately surfaces which phase grew — was it TTFB (a slow deploy), load delay (an accidental loading="lazy"), load time (an unoptimised image), or render delay (a new render-blocking script)?
- LCP element
- img.hero
- LCP phase: load time
- 3510 ms (dominant — 4 MB JPEG)
- LCP phase: TTFB + delay + render
- 610 ms combined
- INP
- 38 ms — good
- CLS
- 0.02 — good
The RUM and CI loop — both halves required.
The complete production observability setup has two halves.
RUM: ship the web-vitals library (or equivalent), which uses PerformanceObserver to capture LCP, INP, and CLS exactly as Chrome scores them, plus attribution (element, phase split, interaction target, shifting nodes). Send them to telemetry tagged by route, device class, country, and release. This is the real verdict — it catches regressions a lab never will, especially device-class regressions and interactions that only real users trigger.
CI: a synthetic gate — Lighthouse CI or a Playwright trace — that runs on every PR, throttled to a realistic mid-tier device, asserting budgets on LCP, total blocking time (the lab proxy for INP), and CLS, and failing the build on a regression.
Neither alone is enough. Lab without RUM ships regressions that only real devices reveal. RUM without a lab gate means every regression reaches production before anyone sees it.
The soft-navigation gap — why SPA vitals need explicit instrumentation.
Core Web Vitals were designed around full page loads. In a single-page app, the first load has a real LCP — but subsequent client-side route changes (“soft navigations”) historically had no LCP measurement at all, because no document load event fires. To a user a soft navigation feels exactly like a page load — they clicked a link and expect new content fast — but the metric did not see it.
Chrome has been shipping soft-navigation support to attribute LCP and other vitals to client-side route changes, but coverage is still maturing and not all frameworks emit the right hints. The consequence: for a SPA, do not assume your vitals story is complete just because the initial-load numbers are green. Snappy first load + sluggish route transitions is a real, common, and historically under-measured failure. Instrument soft navigations explicitly with the PerformanceObserver soft-navigation entries or your own RUM marks on router.beforeEach / route change events.
Why this works
The CLS session window (worst 5-second cluster rather than lifetime sum) is one concrete example of the spec evolving to match real user experience rather than pure engineering measurement. The original lifetime-sum CLS unfairly penalised long-lived pages and infinite scroll — a shift that happened four minutes into a session counted the same as one that happened at second 2. The session window focuses the metric on concentrated bad behavior: a burst of shifts during ad reload or a batch of unsized images loading. It is more representative of what a user actually notices in context, and it changed how you reason about CLS: spreading shifts apart across windows can reduce the score even without fixing the root cause, but the root cause (unreserved content) still creates a bad experience.
[LCP] value: 4120 ms rating: poor
element: img.hero
url: /assets/hero-original.jpg (3.8 MB, JPEG, 4000x3000)
phase split: ttfb 280ms | loadDelay 90ms | loadTime 3510ms | renderDelay 240ms
[INP] value: 38 ms rating: good
[CLS] value: 0.02 rating: good Read the phase split and the element details — which phase dominates, what is the root cause, and what is the fix? What should the team NOT do, and why?
Which browser API is the basis for measuring LCP, INP, layout shifts, and Long Animation Frames in real-user monitoring?
Design the Core Web Vitals strategy for a media site: article pages with a hero image, ads, embeds, and a comments section. Hit good LCP, INP, and CLS at p75 in the field, and keep them green over time.
- Hero image is the LCP element on every article.
- Ads and social embeds load late and into the article body.
- Comments section is interactive and below the fold.
- The site is server-rendered and hydrates.
- Targets: LCP ≤2.5 s, INP ≤200 ms, CLS ≤0.1 — all at field p75.
- Regressions must be caught before they reach production.
- Hero LCP: in the HTML, sized, fetchpriority='high', never lazy, modern format.
- CLS: reserve space for everything that loads late — images, ads, embeds, fonts.
- INP: minimise hydration — static/Server Components for the article, comments as a scroll-hydrated island.
- RUM with web-vitals + attribution is the field verdict; a throttled CI gate catches regressions before production.
- Measure all three before and after every change — fixing one can break another.
Why can a page show a great Lighthouse score but still be flagged for poor Core Web Vitals in Search Console?
- 01A page has poor LCP. Walk through how you diagnose it using the phase split and why reading the phase split is the key step.
- 02Explain the complete production observability setup for Core Web Vitals — what RUM provides, what CI provides, and why neither is sufficient alone.
- 03Why are INP and the SSR/hydration material inseparable, and what does an early-only INP pattern tell you?
The three Core Web Vitals share a performance budget: inlining CSS helps LCP render delay but hurts TTFB; lazy-loading reduces bundle size but can add LCP load delay if accidentally applied to the hero; generous CLS space reservation can change the LCP candidate. Every change must be measured against all three metrics, not one in isolation. Complete production observability requires both RUM (web-vitals library + PerformanceObserver, sending attributed LCP/INP/CLS to telemetry tagged by route and device class) and a throttled CI gate (Lighthouse CI or Playwright, budgeting LCP/TBT/CLS per PR). RUM is the real verdict; CI catches regressions before they reach production. INP in SPAs is further complicated by soft navigations, which historically had no LCP attribution — instrument route changes explicitly. The field p75 from CrUX is the only number that determines ranking; a lab fix that does not move it did not help real users.
appears again in267
- Why GraphQL gets N+1junior
- DataLoader mechanics: tick-boundary batchingmiddle
- Batch function contracts: ordering, shapes, errorsmiddle
- Federation and lookahead: batching beyond DataLoadermiddle
- Query complexity defences: depth, cost, persisted queriesmiddle
- Senior GraphQL API: scheduling contract, tenant isolation, observabilitysenior
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Why idempotency: making retries safejunior
- Server-side state machine: four states of an idempotency keymiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Outbox and inbox: effectively-once across the dual-write boundarymiddle
- Concurrency and cache architecture for idempotency at scalesenior
- Observability, production failures, and global-scale designsenior
- What is a cache stampede and why it makes things worsejunior
- Lock and single-flight: bounding concurrent rebuildsmiddle
- XFetch: coordination-free probabilistic early expirationmiddle
- Stale-while-revalidate and CDN request coalescingmiddle
- Detecting stampedes and designing TTL for productionmiddle
- Metastable failure, fencing tokens, and production postmortemssenior
- What a relation is: tables, rows, keys, and constraintsjunior
- Constraints, keys, and Postgres data typesmiddle
- Normal forms, denormalization, and why schemas stickmiddle
- JSONB, arrays, and when a side table winsmiddle
- Heap storage, TOAST, and column alignmentsenior
- Schema integrity: deferral, versioning, and production failure modessenior
- Relational vs document, wide-column, graph, and key-valuesenior
- What an index is and how it speeds up queriesjunior
- The leading-column rule and composite index designmiddle
- Partial, expression, and covering indexesmiddle
- Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updatesmiddle
- Index-only scans, the Visibility Map, and INCLUDEsenior
- Production failure modes and the index audit playbooksenior
- Index design exercise: full-text search strategysenior
- EXPLAIN and execution plans: what the planner decides and whyjunior
- Scan types: Seq, Index, Bitmap, Index-Onlymiddle
- Join algorithms and the row-estimate cascademiddle
- pg_statistic, ANALYZE, and production observabilitymiddle
- Extended statistics: fixing correlated-column estimate failuressenior
- Plan cache, cost-constant tuning, and planner internalssenior
- Production failure modes and plan stabilitysenior
- MVCC: why readers and writers never wait for each otherjunior
- Row versions and snapshots: the on-disk mechanicsmiddle
- HOT updates and isolation levels: what you gain and what you paymiddle
- Vacuum and bloat: keeping the storage tax boundedmiddle
- CLOG, XID wraparound, and MultiXact: deep visibility internalssenior
- SSI internals and production autovacuum tuningsenior
- Real-world MVCC failures, deployment patterns, and distributed snapshotssenior
- Connection pools: amortising the cost of a Postgres backendjunior
- PgBouncer session, transaction, and statement modesmiddle
- Pool sizing: the (cores × 2) + spindles formula and the two-layer stackmiddle
- Pool exhaustion and idle-in-transaction: the 3 AM failure modemiddle
- Migrating to transaction mode: rollout playbook and PgBouncer 1.21 prepared statementsmiddle
- The Postgres process model and why raising max_connections degrades throughputsenior
- Pooler landscape 2026, serverless connection storms, and the full failure-mode taxonomysenior
- What a schema migration is and why it replaces ad-hoc DDLjunior
- ADD COLUMN: instant in PG 11+ vs rewrite in older Postgresjunior
- The lock-queue failure mode: why instant DDL can freeze the databasemiddle
- Safe DDL patterns: NOT VALID, CONCURRENTLY, and unsafe-op fixesmiddle
- Expand-contract: zero-downtime for breaking schema changesmiddle
- Advisory locks, migration tools, and deploy coordinationsenior
- Migration failure taxonomy and production disciplinesenior
- Why sharding exists: the single-Postgres ceilingjunior
- Shard-key selection: hash, range, list, and directory strategiesmiddle
- Partitioning vs sharding: same word, two different thingsmiddle
- Co-location and Citus: the invariant that makes sharding usablemiddle
- The hot-shard failure mode: detection, isolation, and durable policymiddle
- Schema-based sharding and multi-tenancy alternativessenior
- Online resharding, 2PC, and the operational cost of shardingsenior
- The seven acts: from CREATE TABLE to Citusjunior
- Acts 1–3 in depth: schema, indexes, and planner statisticsmiddle
- Acts 4–6 in depth: MVCC bloat, connection pooling, and safe migrationsmiddle
- Act 7 in depth: sharding, co-location, and the seven-tier tradeoff cascademiddle
- Observability, anti-patterns, and production triagesenior
- Raft roles, terms, and why majority quorums prevent split brainjunior
- How Raft replicates a log entry and decides it is safe to commitmiddle
- Raft leader election: timeouts, voting rules, and the four safety propertiesmiddle
- Raft in the real world: partitions, slow disks, and client routingmiddle
- Raft extensions: pre-vote, learners, snapshots, and linearizable readssenior
- Raft in production: membership changes, Multi-Raft, and observabilitysenior
- Where data fetching happens — and why it decides LCPjunior
- Fetch waterfalls — diagnosis and the Promise.all curemiddle
- React Server Components and Suspense streamingmiddle
- Client-side cache: TanStack Query, SWR, and stale-while-revalidatemiddle
- LCP, prefetch, and race conditions in interactive fetchingmiddle
- Senior internals: RSC payload, caching layers, and production failure modessenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- The three-way handshakejunior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- DNS: what it does and why it existsjunior
- The resolver walk: referrals, record types, and gluemiddle
- TTL, caching, and DNS propagationmiddle
- The 1-RTT handshake: key shares and ECDHEmiddle
- Session resumption and 0-RTTmiddle
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket frame format: opcodes, masking, fragmentationmiddle
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Session affinity, consistent hashing, and the right fixmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- SLI, SLO, and the error budget: reliability by the numbersjunior
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Production SLO failures, self-observability, security, and the big picturesenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- The incident loop: from pager to postmortem to preventionmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Cache lines, struct layout, and false sharingmiddle
- Branch prediction and branchless codemiddle
- SIMD, SoA vs AoS, and memory bandwidthmiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- Cache-oblivious algorithms, PGO, and production failuressenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior
- At-most-once, at-least-once, exactly-once: the three delivery contractsjunior
- The three failure legs — where duplicates and losses actually happenmiddle
- Consumer-side dedup: the cheapest path to exactly-once processingmiddle
- Kafka exactly-once semantics: idempotent producer and transactionsmiddle
- SQS visibility timeout, DLQ, and the outbox patternmiddle
- Exactly-once in production: impossibility proof, hybrid patterns, and real incidentssenior
- What OAuth is and why passwords are not the answerjunior
- Authorization code flow with PKCEmiddle
- ID token validation and JWKS cache managementmiddle
- Refresh token rotation and scope-based least privilegemiddle
- Sender-constrained tokens: DPoP and mTLSsenior
- OAuth in production: audience attacks, observability, and real failuressenior