Data Engineering DATA · 07 · 01

Vector search: the recall–latency–memory triangle behind RAG

Semantic search ranks embeddings by distance, but exact kNN is O(N·d) and dies at scale. ANN indexes like HNSW buy speed by trading recall — and the wrong ef param drops recall silently while every query still returns ten rows.

DATA Junior ◷ 17 min

Level

FoundationsJuniorMiddleSenior

The RAG demo nailed every question. In prod, support says the bot “can’t find” docs it obviously has. Nothing errors — every query returns its ten chunks, ranked, fast. The catch: the team shipped an HNSW index but never raised hnsw.ef_search off the default of 40, and then bolted on a WHERE tenant_id = ? filter. Recall had quietly fallen to ~30%. The system was confidently returning the wrong ten rows, and no log, no metric, no exception ever said so.

In the next fifteen minutes you will see exactly why recall drops are invisible, which index knobs move which triangle edge, and what to measure before your next RAG deploy.

Embeddings are points; search is distance

An embedding model turns text (or an image) into a fixed-length vector — 768 dims for many open models, 1536 for OpenAI’s text-embedding-3-small, 3072 for the large one. Semantically similar inputs land near each other in that space, so “find relevant docs” becomes “find the nearest vectors to the query vector.” Nearness is a metric: cosine similarity (angle), dot/inner product, or L2 (Euclidean) distance. For normalized embeddings cosine and dot rank identically, which is why people use inner product as the cheaper equivalent.

The naive answer is exact k-nearest-neighbor: compare the query to every stored vector and keep the top k. That is O(N·d) per query — at 10M vectors × 1536 dims that is ~15 billion multiply-adds for a single search. Fine for 50k rows, fatal for 50M. This is why a brute-force ORDER BY embedding <-> $1 LIMIT 10 with no index is correct but unshippable past a few hundred thousand rows.

Search = nearness. The query is embedded into the same space as the documents; the k closest vectors by cosine or dot distance are returned. Far vectors are irrelevant. ANN approximates this top-k to avoid scanning every point.

ANN: trade recall for speed

Why should you care about recall before touching any other knob? Because when recall falls, nothing in your monitoring tells you — no error, no alert, no slow query. You only notice when a user reports that the bot “can’t find” something it obviously has.

Approximate nearest neighbor (ANN) indexes give up the guarantee of finding the true top-k in exchange for sub-linear search. The quality metric is recall@k: of the true top-10, how many did the index actually return? Recall 1.0 means perfect; 0.9 means one of every ten “best” results is missing. The senior insight is that recall is a tuning dial, not a fixed property — and crucially, a query at recall 0.3 looks identical to a query at recall 1.0: same row count, same latency profile, same lack of errors. You only catch it by measuring recall against an exact baseline.

The dominant index is HNSW (Hierarchical Navigable Small World). It builds a layered graph: sparse long-range links at the top for fast jumps across the space, dense short links at the bottom for precision. A query enters at the top, greedily walks toward the query vector, and descends. Three knobs govern the triangle:

M — links per node (pgvector default 16). Higher M → better recall and a bigger, denser graph. The HNSW graph itself runs 2–5× the size of the raw vectors in RAM, and M is the main lever on that.
ef_construction — candidate pool size while building (default 64). Higher → better graph quality and recall, but build time scales roughly linearly; 400 vs 200 is about 2× slower to build.
ef_search — candidate pool size at query time (default 40 in pgvector). This is the runtime recall/latency dial. Raising it from 100 to 500 might move recall from ~85% to ~98% at the cost of going from ~1ms to ~5ms per query.

Together, these three knobs define the recall–latency–memory triangle: M sets the memory budget, ef_construction locks in graph quality at build time, and ef_search is the only dial you can turn in production without rebuilding. When you see unexplained answer quality regressions, ef_search is where to look first.

Index	Memory	Recall	Build / writes	Pick when
`HNSW`	High (2–5× raw vectors)	Highest, easiest to hit	Slow build, cheap incremental inserts	Default for RAG; data changes; recall matters
`IVFFlat`	~1× raw vectors	Good, drifts as data changes	Fast build, needs data present first	Big static-ish set, RAM tight
`IVF-PQ`	Lowest (compress ~7×)	Lower (~70%), recoverable	Build + train codebook	100M+ vectors won’t fit in RAM

IVF and IVF-PQ: cluster, then quantize

The other family is IVF (inverted file). It k-means-clusters all vectors into lists buckets; a query only scans the probes nearest buckets instead of everything. In pgvector that is ivfflat with lists at build time and ivfflat.probes at query time (default 1 — far too low for good recall). IVF’s memory is close to 1× the raw vectors, but it has a sharp edge: the clusters are fixed at build, so as you insert new data the centroids go stale and recall drifts until you reindex. HNSW has no such drift — graph inserts are constant-cost and keep quality.

IVF-PQ adds product quantization: split each vector into sub-vectors and replace each with the id of its nearest centroid from a small codebook, compressing a 1536-dim float32 vector (6 KB) down to a few dozen bytes. NVIDIA showed 360 GiB of FP32 vectors compressed to 54 GiB while holding >95% recall on a billion vectors. The cost is recall loss from quantization error — IVF-PQ alone often lands near 70% of HNSW’s recall and needs a rerank step (re-score the top candidates with full-precision distances) to recover. PQ is the tool when 100M+ vectors simply will not fit in RAM, not a default.

▸Why this works

The “curse of dimensionality” is why this is hard at all. In 1536 dimensions almost every pair of random points is roughly equidistant — distances concentrate, so the contrast between the nearest neighbor and the 100th gets thin. That is exactly what makes recall fragile: a slightly-too-greedy graph walk or a too-coarse quantizer can’t tell the real neighbor from a near-miss, and approximate methods give up the guarantee precisely because exact separation costs O(N·d).

Filtering and hybrid search: where prod breaks

Two production traps. First, metadata filtering interacts badly with ANN. pgvector does post-filtering: it walks the HNSW graph for ef_search candidates, then applies your WHERE tenant_id = ?. If the filter keeps 1% of rows and ef_search is 40, you might end up with almost nothing — the graph navigated a tiny, near-disconnected subgraph and recall collapses. pgvector 0.8.0 added hnsw.iterative_scan (strict_order / relaxed_order) to keep pulling candidates until the filter is satisfied, which trades latency back for the recall you lost.

Second, pure vector search misses exact matches. A query for an error code, a SKU, or a function name is a lexical need, and embeddings smear it into “semantically near” — so panic: nil pointer can rank below a vaguely-related paragraph. The fix is hybrid search: run BM25 (keyword) and vector in parallel and fuse with Reciprocal Rank Fusion (RRF = Σ 1/(k + rank), k≈60), which combines by rank position so the two incompatible score scales never have to be compared. Teams report jumping from ~65–78% to ~91% recall@10 by adding the keyword leg.

Pick the best fit

A RAG service over ~5M chunks gets continuous ingestion, needs high recall, and has RAM to spare. Pick the index strategy.

Quiz

Your HNSW-backed search returns ten ranked rows in 2ms, no errors — but answers are subtly wrong. What's the first thing to check?

Quiz

Users search for the exact error string 'ECONNREFUSED' and pure vector search buries the matching doc. Best fix?

Order the steps

Order how an HNSW query finds nearest neighbors and how you'd tune it:

1 Embed the query into the same vector space as the stored documents
2 Enter the HNSW graph at the top, sparse layer for fast long-range jumps
3 Greedily walk toward the query vector, descending into denser layers
4 Collect ef_search candidates at the bottom layer and return the top k by distance
5 Measure recall@k vs an exact baseline; raise ef_search if recall is too low

ef_search is the only recall dial you can turn in production without rebuilding the index — but its default of 40 is dangerously low under selective metadata filters. Raising it from 40 to 400 can move recall from ~30% to ~95% at the cost of ~10ms extra latency per query.

Recall before you leave

01
Why is a low-recall ANN index so dangerous in production, and how do you actually detect it?
02
When would you reach for IVF-PQ over HNSW, and what's the cost you accept?

Recap

Vector search turns “find relevant content” into “find the nearest embeddings,” ranked by cosine, dot, or L2 distance. Exact kNN is correct but O(N·d), so past a few hundred thousand rows you switch to approximate nearest neighbor and accept a recall-for-speed trade. HNSW is the default: a layered graph tuned by M and ef_construction at build and ef_search at query time, costing 2–5× the raw vectors in RAM but delivering the highest recall with the least drift as data changes. IVF clusters and IVF-PQ compresses — the latter is for corpora too big for RAM, paying recall (≈70% of HNSW) you recover with reranking. The senior discipline is the triangle of recall, latency, and memory, and one habit above all: recall drops are silent — identical row counts and latency — so you measure recall@k against an exact baseline, raise ef_search deliberately, watch out for selective metadata filters that post-filtering destroys (use iterative scan), and add hybrid BM25 + vector with rank fusion whenever exact matches matter. Now when you see a RAG system returning confident answers that feel wrong, the first question to ask is not “is the model good enough?” but “what is our recall@k against the exact baseline?” — because the index may be silently returning the wrong ten rows.

Practice

Start at the top. Tasks go easiest → hardest: recall a fact, apply it to a case, then a senior-level stretch. Open one, attempt it, then reveal.

recallapplystretch0 of 5 done

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.