Data Engineering
Vector search: the recall–latency–memory triangle behind RAG
The RAG demo nailed every question. In prod, support says the bot “can’t find” docs it obviously has. Nothing errors — every query returns its ten chunks, ranked, fast. The catch: the team shipped an HNSW index but never raised hnsw.ef_search off the default of 40, and then bolted on a WHERE tenant_id = ? filter. Recall had quietly fallen to ~30%. The system was confidently returning the wrong ten rows, and no log, no metric, no exception ever said so.
Embeddings are points; search is distance
An embedding model turns text (or an image) into a fixed-length vector — 768 dims for many open models, 1536 for OpenAI’s text-embedding-3-small, 3072 for the large one. Semantically similar inputs land near each other in that space, so “find relevant docs” becomes “find the nearest vectors to the query vector.” Nearness is a metric: cosine similarity (angle), dot/inner product, or L2 (Euclidean) distance. For normalized embeddings cosine and dot rank identically, which is why people use inner product as the cheaper equivalent.
The naive answer is exact k-nearest-neighbor: compare the query to every stored vector and keep the top k. That is O(N·d) per query — at 10M vectors × 1536 dims that is ~15 billion multiply-adds for a single search. Fine for 50k rows, fatal for 50M. This is why a brute-force ORDER BY embedding <-> $1 LIMIT 10 with no index is correct but unshippable past a few hundred thousand rows.
ANN: trade recall for speed
Approximate nearest neighbor (ANN) indexes give up the guarantee of finding the true top-k in exchange for sub-linear search. The quality metric is recall@k: of the true top-10, how many did the index actually return? Recall 1.0 means perfect; 0.9 means one of every ten “best” results is missing. The senior insight is that recall is a tuning dial, not a fixed property — and crucially, a query at recall 0.3 looks identical to a query at recall 1.0: same row count, same latency profile, same lack of errors. You only catch it by measuring recall against an exact baseline.
The dominant index is HNSW (Hierarchical Navigable Small World). It builds a layered graph: sparse long-range links at the top for fast jumps across the space, dense short links at the bottom for precision. A query enters at the top, greedily walks toward the query vector, and descends. Three knobs govern the triangle:
M— links per node (pgvector default 16). Higher M → better recall and a bigger, denser graph. The HNSW graph itself runs 2–5× the size of the raw vectors in RAM, and M is the main lever on that.ef_construction— candidate pool size while building (default 64). Higher → better graph quality and recall, but build time scales roughly linearly; 400 vs 200 is about 2× slower to build.ef_search— candidate pool size at query time (default 40 in pgvector). This is the runtime recall/latency dial. Raising it from 100 to 500 might move recall from ~85% to ~98% at the cost of going from ~1ms to ~5ms per query.
| Index | Memory | Recall | Build / writes | Pick when |
|---|---|---|---|---|
HNSW | High (2–5× raw vectors) | Highest, easiest to hit | Slow build, cheap incremental inserts | Default for RAG; data changes; recall matters |
IVFFlat | ~1× raw vectors | Good, drifts as data changes | Fast build, needs data present first | Big static-ish set, RAM tight |
IVF-PQ | Lowest (compress ~7×) | Lower (~70%), recoverable | Build + train codebook | 100M+ vectors won’t fit in RAM |
IVF and IVF-PQ: cluster, then quantize
The other family is IVF (inverted file). It k-means-clusters all vectors into lists buckets; a query only scans the probes nearest buckets instead of everything. In pgvector that is ivfflat with lists at build time and ivfflat.probes at query time (default 1 — far too low for good recall). IVF’s memory is close to 1× the raw vectors, but it has a sharp edge: the clusters are fixed at build, so as you insert new data the centroids go stale and recall drifts until you reindex. HNSW has no such drift — graph inserts are constant-cost and keep quality.
IVF-PQ adds product quantization: split each vector into sub-vectors and replace each with the id of its nearest centroid from a small codebook, compressing a 1536-dim float32 vector (6 KB) down to a few dozen bytes. NVIDIA showed 360 GiB of FP32 vectors compressed to 54 GiB while holding >95% recall on a billion vectors. The cost is recall loss from quantization error — IVF-PQ alone often lands near 70% of HNSW’s recall and needs a rerank step (re-score the top candidates with full-precision distances) to recover. PQ is the tool when 100M+ vectors simply will not fit in RAM, not a default.
Why this works
The “curse of dimensionality” is why this is hard at all. In 1536 dimensions almost every pair of random points is roughly equidistant — distances concentrate, so the contrast between the nearest neighbor and the 100th gets thin. That is exactly what makes recall fragile: a slightly-too-greedy graph walk or a too-coarse quantizer can’t tell the real neighbor from a near-miss, and approximate methods give up the guarantee precisely because exact separation costs O(N·d).
Filtering and hybrid search: where prod breaks
Two production traps. First, metadata filtering interacts badly with ANN. pgvector does post-filtering: it walks the HNSW graph for ef_search candidates, then applies your WHERE tenant_id = ?. If the filter keeps 1% of rows and ef_search is 40, you might end up with almost nothing — the graph navigated a tiny, near-disconnected subgraph and recall collapses. pgvector 0.8.0 added hnsw.iterative_scan (strict_order / relaxed_order) to keep pulling candidates until the filter is satisfied, which trades latency back for the recall you lost.
Second, pure vector search misses exact matches. A query for an error code, a SKU, or a function name is a lexical need, and embeddings smear it into “semantically near” — so panic: nil pointer can rank below a vaguely-related paragraph. The fix is hybrid search: run BM25 (keyword) and vector in parallel and fuse with Reciprocal Rank Fusion (RRF = Σ 1/(k + rank), k≈60), which combines by rank position so the two incompatible score scales never have to be compared. Teams report jumping from ~65–78% to ~91% recall@10 by adding the keyword leg.
A RAG service over ~5M chunks gets continuous ingestion, needs high recall, and has RAM to spare. Pick the index strategy.
Your HNSW-backed search returns ten ranked rows in 2ms, no errors — but answers are subtly wrong. What's the first thing to check?
Users search for the exact error string 'ECONNREFUSED' and pure vector search buries the matching doc. Best fix?
Order how an HNSW query finds nearest neighbors and how you'd tune it:
- 1 Embed the query into the same vector space as the stored documents
- 2 Enter the HNSW graph at the top, sparse layer for fast long-range jumps
- 3 Greedily walk toward the query vector, descending into denser layers
- 4 Collect ef_search candidates at the bottom layer and return the top k by distance
- 5 Measure recall@k vs an exact baseline; raise ef_search if recall is too low
- 01Why is a low-recall ANN index so dangerous in production, and how do you actually detect it?
- 02When would you reach for IVF-PQ over HNSW, and what's the cost you accept?
Vector search turns “find relevant content” into “find the nearest embeddings,” ranked by cosine, dot, or L2 distance. Exact kNN is correct but O(N·d), so past a few hundred thousand rows you switch to approximate nearest neighbor and accept a recall-for-speed trade. HNSW is the default: a layered graph tuned by M and ef_construction at build and ef_search at query time, costing 2–5× the raw vectors in RAM but delivering the highest recall with the least drift as data changes. IVF clusters and IVF-PQ compresses — the latter is for corpora too big for RAM, paying recall (≈70% of HNSW) you recover with reranking. The senior discipline is the triangle of recall, latency, and memory, and one habit above all: recall drops are silent — identical row counts and latency — so you measure recall@k against an exact baseline, raise ef_search deliberately, watch out for selective metadata filters that post-filtering destroys (use iterative scan), and add hybrid BM25 + vector with rank fusion whenever exact matches matter.