Data Engineering DATA · 07 · 09

Vector search: code and query reading

Read real pgvector DDL and queries — predict the recall/latency behaviour and pick the highest-leverage fix for HNSW params, distance operators, and filtered search.

DATA Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Vector-search bugs live in the DDL and the query, not in an exception. Read the SQL, predict the recall and latency it produces, then choose the fix a senior engineer makes first.

Goal

Practise the loop you run on every recall incident: read the index definition and the query, predict where recall leaks or latency spikes, and reach for the highest-leverage fix.

Snippet 1 — the HNSW index and the default knob

CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- query path, run per request
SELECT id, body FROM docs
ORDER BY embedding <=> $1
LIMIT 10;

Quiz

The index builds fine and queries are fast, but recall@10 measures only ~80%. What is the highest-leverage fix?

Snippet 2 — the distance operator vs the opclass

-- embeddings stored from a model trained on cosine similarity
CREATE INDEX ON items USING hnsw (embedding vector_l2_ops);

SELECT id FROM items
ORDER BY embedding <=> $1   -- cosine operator
LIMIT 10;

Quiz

The query is slow and recall is poor. What is wrong here?

Snippet 3 — IVFFlat probes

CREATE INDEX ON docs USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 1000);

-- query
SELECT id FROM docs ORDER BY embedding <=> $1 LIMIT 10;
-- ivfflat.probes left at its default

Quiz

With lists = 1000 and probes left at the default, recall is far below expectation. Why, and what is the first move?

Snippet 4 — measuring recall against ground truth

-- exact ground truth (no index hint, brute force)
SELECT id FROM docs ORDER BY embedding <=> $1 LIMIT 10;  -- with index disabled

-- ANN result
SET hnsw.ef_search = 100;
SELECT id FROM docs ORDER BY embedding <=> $1 LIMIT 10;   -- with HNSW
-- recall@10 = |exact_ids ∩ ann_ids| / 10

Quiz

A teammate computes recall by checking the ANN result's average distance against a fixed threshold instead of the overlap with an exact top-10. Why is that wrong?

Recap

Every recall incident is read in the DDL and the query: ef_search defaults to 40 and is the runtime recall dial; the query’s distance operator must match the index opclass or the planner falls back to a full scan; IVFFlat’s probes defaults to 1 and scans a single cluster until you raise it; and recall@k is the ID-set overlap against an exact baseline, never a distance threshold. Read the SQL, find the leak, turn the cheapest dial first, then re-measure recall to confirm.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.