Databases
Index types: GIN, GiST, BRIN, Hash, Bloom, and HOT updates
A team adds CREATE INDEX ON events(payload) on a JSONB column and watches query time drop from 4 seconds to 4 seconds. B-tree cannot index inside JSONB. GIN can. Knowing which index type matches which data shape is what separates a working index from a wasted hour and a wasted gigabyte.
Why B-tree is not enough
B-tree indexes one value per row and answers equality, range, and sort queries on that value. When a row contains multiple values (JSONB keys, array elements, text lexemes) or when the “less than” relation is not defined (geometric shapes, IP ranges), B-tree has no useful sort order to walk. The other index types fill those gaps.
| Index type | Best for | Trade |
|---|---|---|
| B-tree | Equality, range, sort on ordered types | ~10-30% of table size; 5-50 µs/write |
| GIN | JSONB containment, arrays, full-text | 2-5x size; slow on high-write JSONB |
| GiST | Geospatial (PostGIS), range types, nearest-neighbour | Extension-dependent size; complex update |
| BRIN | Huge insert-only tables with natural ordering | Tiny size; only works on physically-ordered data |
| Hash | Equality on very wide keys | No range; usually B-tree is better |
| Bloom | Multi-column equality, unpredictable filter combos | Approximate (false positives); heap re-check required |
GIN: inverted index for composite types
Generalized Inverted Index. Where B-tree stores one value per row, GIN stores many values per row — every key in a JSONB document, every element in an array, every lexeme in a tsvector. Internally, GIN is a B-tree of indexed values, each pointing to a posting list (or posting tree for large lists) of TIDs that contain that value.
Use cases:
- JSONB containment:
WHERE payload @> '{"event_type":"login"}'— does this document contain this sub-document? - Array overlap:
WHERE tags && ARRAY['premium','active'] - Full-text search:
WHERE to_tsvector('english', body) @@ to_tsquery('postgres')
GIN indexes are typically 2-5x the size of the indexed column data. On high-write JSONB columns, each row insert updates every indexed key — so fastupdate mode defers those updates to a background process, giving faster writes at the cost of slightly stale index entries until the buffer is flushed. Production discipline: measure GIN write cost after adding it to a write-heavy column; if insert latency spikes, consider fastupdate or narrow to an expression index on known-hot JSONB fields.
GiST: generic search tree
For data that lacks a global sort order: geospatial shapes, IP ranges, timestamp ranges. GiST is extensible — each non-leaf node stores a “predicate” (e.g., a bounding box) that summarises its subtree. Queries prune subtrees by testing the predicate before descending.
Most engineers encounter GiST via PostGIS:
CREATE INDEX ON locations USING GIST (geom);
-- enables: WHERE ST_DWithin(geom, ST_MakePoint(-74, 40.7), 1000)Range types (int4range, tsrange) also use GiST by default. Nearest-neighbour queries (ORDER BY geom <-> point LIMIT 10) require GiST.
BRIN: tiny index for huge ordered tables
Block Range INdex. Stores min and max values per block range (default: 128 pages, ~1 MB of heap). To answer a query, Postgres reads the BRIN metadata, identifies candidate ranges, and scans only those blocks. BRIN is approximate — it narrows the heap scan but cannot eliminate it.
BRIN is only useful when data is physically ordered by the indexed column. The classic case: a time-series table where rows are inserted in timestamp order.
-- 10 billion rows, 8 TB table
CREATE INDEX ON events USING BRIN (created_at);
-- Result: ~few MB index; range query reads only the relevant 128-page blocksFor randomly distributed data, BRIN provides no speedup — every block range’s min-max spans the full range, and Postgres must scan all blocks.
Hash indexes
Equality only; no range support; no ORDER BY. Prior to Postgres 10, Hash indexes were not WAL-logged and could not survive crashes — avoid them in older installations. Post-10 they are stable. The narrow use case: equality lookup on very wide keys (e.g., a 1000-character JSON string) where B-tree’s size would be excessive. In practice, B-tree handles this well enough that Hash indexes remain rare. Default to B-tree unless you have a specific, measured reason.
Bloom indexes for multi-column equality
The bloom extension provides a probabilistic filter index. A Bloom index on (a, b, c, d, e) accelerates any query that filters by any subset of those columns — without the leading-column rule. The trade: it is approximate. Every match is a candidate that must be re-verified against the heap (false positives exist). Use case: tables with many columns where queries combine different filter sets unpredictably, making it impractical to build 2^N composite B-tree indexes.
HOT updates and fillfactor
When a row is UPDATEd, Postgres creates a new tuple version (MVCC) and must update every index referencing that row — even indexes whose columns did not change. This is write amplification: one logical update fans out to N index updates.
HOT (Heap-Only Tuple) eliminates index updates when both conditions hold:
- No indexed column changed in this update.
- The new tuple fits on the same heap page as the old one (there is free space).
When HOT fires, only the heap page changes — index entries still point to the old TID, which chains to the new version. The result: dramatically lower write overhead on update-heavy workloads.
Fillfactor controls how full each page is at insert time (default: 100% for tables, 90% for B-tree indexes). Setting fillfactor to 70–80% on an update-heavy table leaves room on pages for HOT updates:
ALTER TABLE orders SET (fillfactor = 70);
-- Re-fill pages: CLUSTER or VACUUM FULL (offline) or gradual via natural insertsThe trade is initial storage: lower fillfactor means more pages, larger table. Production teams measure update frequency and tune fillfactor on hot tables — a frequently overlooked lever.
Write cost accounting
Every index adds ~5-50 µs per write per index touched. For a table with 10 indexes, each INSERT costs 50-500 µs in index overhead beyond the heap write. Under a write-heavy workload (10,000 inserts/second), that is 0.5-5 ms of overhead per second of clock time. GIN indexes on JSONB columns with many keys can cost 50-200 µs per write — higher than any B-tree.
Senior teams model index budget per table: given write-throughput target, how many indexes can the table afford? Tools: pg_stat_statements for baseline latency, pgbench for synthetic write benchmarks under new index candidates.
Index size vs shared_buffers
Postgres caches hot pages in shared_buffers (typically 25% of server RAM). Large indexes compete with heap pages for cache space. A 5 GB GIN index on a JSONB column can evict hot heap pages, causing queries that were fast from cache to become disk-bound. Symptom: pg_stat_database.blks_hit / blks_read ratio drops after adding a new index. Mitigation: use partial indexes, expression indexes on known-hot fields, or drop unused indexes to reclaim cache.
Why ANALYZE matters more than the index itself
The planner’s choice of which index to use — or whether to use one at all — depends on table statistics collected by ANALYZE (run automatically by autovacuum, manually when needed). Stale statistics cause the planner to under- or over-estimate how many rows a filter matches, leading to wrong plan choices.
Production discipline:
- Run
ANALYZEafter bulk inserts, large UPDATEs, or schema changes affecting distributions. - Tune
autovacuum_analyze_scale_factorlower on tables with skewed data (default 0.2 = 20% change triggers analyze; set 0.02 for hot OLTP tables). - Use
CREATE STATISTICSfor correlated columns (e.g.,country, city) so the planner does not assume independence.
A “wrong index choice” diagnosis is often actually “stale stats.”
- B-tree size, 8-byte key, 100M rows
- ~1 GB
- GIN size vs indexed column data
- 2-5x larger
- BRIN size, 10B-row time-series
- ~few MB
- Hash index use case in production
- rare (B-tree usually better)
- HOT update savings (no indexed column changed)
- eliminates N index updates
- GIN write overhead on high-write JSONB
- ~50-200 µs per row
- B-tree write overhead per index
- ~5-50 µs per write
- fillfactor default (table / B-tree)
- 100% / 90%
- fillfactor for update-heavy HOT-friendly tables
- 70-80%
- ANALYZE trigger (autovacuum_analyze_scale_factor default)
- 20% of rows changed
A table stores geolocation data and the hot query is: WHERE ST_DWithin(geom, $1, 500). Which index type is correct?
HOT updates skip index maintenance. What is the requirement for a HOT update to trigger?
Match the scenario to the correct index type — order from best match to worst match for the scenario: 'search JSONB documents by key presence'.
- 1 GIN — designed for multi-value types; answers @> and ? operators natively
- 2 Expression B-tree on (payload->>'key') — works for equality on a known field, but not general containment
- 3 B-tree on payload — indexes the whole document as opaque; useless for containment queries
- 4 BRIN on payload — only useful for min/max on ordered data; irrelevant here
Why this works
Why does Postgres ship six index types instead of one universal index? Because the fundamental data structures are incompatible. B-tree requires a total order (every value is comparable with less-than). JSONB documents have no total order. Geometric shapes require a spatial predicate, not a linear key. Time-series blocks need range summaries. Trying to fit all of these into B-tree would produce either a useless or astronomically expensive structure. The design choice — multiple specialized types — keeps each index tight and appropriate for its data shape.
- 01A colleague suggests adding a GIN index to a JSONB audit_log column that receives 5,000 inserts per second. What are the write-cost implications, and what alternative might work?
- 02Explain when BRIN is dramatically better than B-tree and when it is useless.
- 03What is write amplification in the context of indexes, and what mechanisms reduce it?
Postgres ships six index types. B-tree covers equality, range, and sort on ordered types — 95% of production use. GIN indexes multiple values per row (JSONB keys, array elements, tsvector lexemes) using an inverted structure; it is 2-5x larger and slower to write than B-tree. GiST handles geospatial and range types where no linear sort order exists. BRIN stores min-max per block range and is only useful when data is physically ordered (time-series inserts). Hash is equality-only and rarely preferred over B-tree. Bloom is a probabilistic multi-column equality index that avoids the leading-column rule at the cost of false positives.
HOT updates skip index maintenance when no indexed column changed and the heap page has space — tune fillfactor to 70-80% on update-heavy tables to keep HOT firing. GIN write cost on high-write JSONB columns can reach 50-200 µs per insert; narrow to expression indexes on known-hot fields when write overhead is the bottleneck. ANALYZE must be current for the planner to pick the right index; stale statistics are the most common cause of “unexpected Seq Scan despite existing index.”
appears again in174
- The journey of a request: seven stops from socket to responsejunior
- Accept and parse: from kernel queue to a typed requestmiddle
- Routing and middleware: choosing what runs, and in what ordermiddle
- Handler and response: from business logic to bytes on the wiremiddle
- Streaming and backpressure: when the client reads slower than you writesenior
- Timeouts and tail latency: budgets, deadlines, and the fan-out trapsenior
- Middleware and DI: the two patterns that shape every backendjunior
- Writing middleware: signatures, next(), and the three framework modelsmiddle
- Inversion of control: how dependencies reach a classmiddle
- DI scopes and lifecycles: singleton, request, transientmiddle
- DI as a testing seam: fakes, mocks, and the boundary that matterssenior
- DI containers in production: resolution graphs, circular deps, and when not tosenior
- Blocking vs non-blocking I/O: two ways to waitjunior
- The event loop: one thread, ordered phasesmiddle
- What blocks the loop: CPU work and sync callsmiddle
- Offloading CPU work: worker threads and the libuv poolmiddle
- Backpressure and bounded concurrencysenior
- Throughput under load: tail latency and saturationsenior
- Why pool: the cost of creating a connectionjunior
- Pool sizing: why bigger is not fastermiddle
- Acquisition and timeouts: the wait queue is the real latency dialmiddle
- Retry strategies: backoff, jitter, and thundering herdmiddle
- Observability, production failures, and global-scale designsenior
- Tasks, microtasks, and scheduler.yield()middle
- Timer accuracy, throttling, and idle workmiddle
- Node.js event loop: phases, nextTick, and loop lagsenior
- Rendering strategies: SSG, SSR, ISR, streaming, and hydrationjunior
- SSG, SSR, ISR, streaming, and RSC — how each worksmiddle
- Hydration cost: selective, progressive, islands, resumabilitymiddle
- Core Web Vitals: what LCP, INP, and CLS measurejunior
- LCP: four phases, one dominant costmiddle
- INP: input delay, processing, presentationmiddle
- Lab vs field: why the two disagree and how to use eachmiddle
- Metric tradeoffs, RUM attribution, and the CI+field loopsenior
- The full picture: URL to LCP to INP as a relay racejunior
- Eight layers traced: from the service worker to the second navigationmiddle
- Five canonical breaks: where production reliably diessenior
- The three-track method: reading traces and building a monitored systemsenior
- Bits on the wirejunior
- Latency mathmiddle
- Bufferbloat and congestionsenior
- The physical frontiersenior
- Sequence numbers and connection statemiddle
- Flow control and congestion controlmiddle
- BBR, production observability, and beyond TCPsenior
- CDN: putting content next doorjunior
- Anycast and GeoDNS: routing to the nearest edgemiddle
- Tiered cache and Cache-Controlmiddle
- Vary header and cache keysmiddle
- Stale-while-revalidate and cache stampedesenior
- Edge workers and edge-side compositionsenior
- CDN operations and observabilitysenior
- WebSocket: the HTTP upgrade handshakejunior
- WebSocket vs SSE vs long-polling: choosing the right transportmiddle
- WebSocket backpressure: when clients can''''t keep upmiddle
- Reconnection: jittered backoff, thundering herd, message resumptionsenior
- WebSocket at scale: HTTP/2 multiplexing, permessage-deflate, C10Msenior
- WebSocket in production: proxies, security, and distributed architecturesenior
- What reverse proxies dojunior
- Balancing algorithms: round-robin to power-of-two-choicesmiddle
- L4 vs L7 load balancing and client-IP preservationmiddle
- Health checks, connection draining, and slow startmiddle
- Retry storms, circuit breakers, and load sheddingsenior
- Resilient LB architecture: anycast, zone-aware routing, and observabilitysenior
- Why QUIC and not TCP+TLSjunior
- QUIC streams and head-of-line blockingjunior
- Integrated handshake and 1-RTTmiddle
- Connection IDs and network migrationmiddle
- Loss detection and congestion controlmiddle
- 0-RTT resumption and packet encryptionsenior
- Deployment tradeoffs and CPU costsenior
- DDoS: what it is and why it worksjunior
- Amplification attacks and state exhaustionmiddle
- Rate limiting: algorithms and architecturemiddle
- WAFs, firewalls, mTLS, and HSTSmiddle
- DNS cache poisoning and BGP hijackingsenior
- Defense-in-depth architecture and attack economicssenior
- The twelve layers: one URL, seven actorsjunior
- DNS, TCP, TLS in sequence: where the milliseconds gomiddle
- Critical render path and Core Web Vitalsmiddle
- Proxy intercepts and security gates: rate limiters, WAF, mTLSmiddle
- Alternate paths: QUIC 0-RTT, WebSocket upgrade, connection migrationmiddle
- Observability: distributed traces, USE/RED, and samplingsenior
- Resilience: cascading retries, circuit breakers, and error budgetssenior
- What the three signals are: logs, metrics, and tracesjunior
- Metrics and cardinality: the cost model of a time-series databasemiddle
- Logs and volume: the cost model of structured loggingmiddle
- Traces and sampling: the cost model of distributed tracingmiddle
- Join keys and exemplars: making the three signals composemiddle
- Observability 2.0: wide events and the cost shiftsenior
- Failure modes and engineering practice: cardinality budgets, PII, and samplingsenior
- Why structured logs exist: the diary vs the spreadsheetjunior
- The production log schema: fields every line must carrymiddle
- Log levels and alert routingmiddle
- Sampling strategies and log costmiddle
- PII redaction and log injectionsenior
- Trace context propagation in logssenior
- OTel Logs Data Model and audit logs as a subsystemsenior
- OTel signals, Semantic Conventions, and the OTLP wire formatmiddle
- Auto-instrumentation and manual spans: the 80/20 of OTelmiddle
- The OTel Collector: receivers, processors, exporters, and deployment patternsmiddle
- Sampling strategies: head, tail, and parent-basedmiddle
- Vendor neutrality, eBPF instrumentation, the Operator, and browser/serverless OTelsenior
- Operating the OTel Collector: reliability, version skew, failure modes, and governancesenior
- RED and USE: two checklists, one triage disciplinejunior
- Instrumenting RED in Prometheus: counters, histograms, and cardinality disciplinemiddle
- USE on Linux: CPU, memory, disk, network, and PSImiddle
- Golden signals, dashboard layout, and service mesh auto-REDmiddle
- Cardinality as a cost driver: labels, PII, exemplars, and samplingmiddle
- Native histograms, SLO tie-in, and production failure patternsmiddle
- Choosing SLIs and SLO targets: ratios, not feelingsmiddle
- Multi-window multi-burn-rate alerting: why AND beats ORmiddle
- Error budget policy, latency SLOs, and composite journeysmiddle
- Iceberg SLIs, composite SLO math, and SLA vs SLOsenior
- Flame graphs: reading the picture that shows where time goesjunior
- Sampling vs instrumentation profiling: why 99 Hz wins in productionmiddle
- Profile types: CPU, memory, off-CPU, mutex — which one to reach formiddle
- Continuous profiling: always-on flame graphs with eBPF and trace-id correlationmiddle
- How flame graphs are built from samples, and the production workflows that use themmiddle
- Linux perf, eBPF internals, PGO, and the limits of samplingsenior
- Profiling in production: security, war stories, OTel profiles, and the infrastructure designsenior
- The debugging funnel: SLO → RED → trace → profilejunior
- OTel architecture: one SDK, four signals, one wire formatmiddle
- Cost discipline: keeping observability under 5% of infra spendmiddle
- Scale, security, and the ROI of observable systemssenior
- Why profile first: measure where time actually goesjunior
- Amdahl''''s law and self-time: the ceiling on every speedup you can shipmiddle
- The measurement loop: microbench, macrobench, prod profile, observer effectmiddle
- Reading flame graphs: shapes, per-language profilers, and the 60-second scanmiddle
- Statistical baselines: why one run is not a measurementmiddle
- Profiler history and microbenchmark pitfalls: Knuth to GWPsenior
- Hardware counters, cold-start profiles, and profile securitysenior
- Continuous profiling at scale: costs, CI gates, trace correlation, and anti-patternssenior
- What makes a hot path: symptom vs causejunior
- Five shapes of hotspot: CPU, alloc, cache, lock, syscallmiddle
- Reading parent and child chains: where to apply the fixmiddle
- JIT deopt, the fix-and-verify loop, and PR-time profilingmiddle
- Hardware counters and Intel TMA: sub-category diagnosissenior
- False sharing and native-bridge hot pathssenior
- Hot paths in production: security, tail latency, and tooling lineagesenior
- Memory hierarchy: why the same O(N) loop can be 17x slowerjunior
- Row-major vs column-major: access order and the 9x gapjunior
- Branch prediction and branchless codemiddle
- Hardware prefetcher, TLB, and memory-level parallelismsenior
- GC basics: what the runtime taxes you forjunior
- GC algorithms: generational, concurrent, and per-runtimemiddle
- GC tradeoffs: pause, throughput, heap — and object poolingmiddle
- GC tuning: pacing, heap shape, and allocation observabilitymiddle
- GC internals: tri-color invariant, write barriers, and per-runtime deep-divessenior
- GC in production: observability, security, edge cases, and fleet governancesenior
- N+1: one logical operation, many round-tripsjunior
- Fix families: JOIN, IN, preload, and DataLoadermiddle
- Detecting N+1: query logs, APM traces, and CI gatesmiddle
- DataLoader: batching across resolver treesmiddle
- Cross-protocol N+1: HTTP fan-out and Redis MGETmiddle
- N+1 at scale: pool exhaustion, plan changes, and denormalisationsenior
- Batching: amortize fixed cost per operationjunior
- The batching window: size and wait timemiddle
- Batching in Kafka and Postgresmiddle
- io_uring and observability of batchingmiddle
- From Nagle to io_uring: evolution of batchingmiddle
- Backpressure, failure isolation, and batch security in productionsenior
- What a bundle actually costs: download, parse, compile, executejunior
- Core Web Vitals: LCP, INP, and CLSmiddle
- Code splitting: route-level, component-level, vendor splittingmiddle
- Tree shaking and compression: removing what you don''''t usemiddle
- Third-party scripts: the silent budget killermiddle
- CI enforcement and RUM: making budgets stickmiddle
- V8 JIT pipeline, HTTP priorities, and bundle securitysenior
- The performance loop: discipline, not a projectjunior
- Classify and fix: matching bottleneck families to remediesmiddle
- Observability stack and CI gates: catching regressions before they shipmiddle
- Incident to enforcement: SLO burn to verified fix in 35 minutesmiddle
- Culture, economics, and org-scale performancesenior