Crux Read real index-build code, an analyzer pipeline, a BM25 scoring sketch, and a Postgres tsvector query, then predict the behaviour and pick the highest-leverage fix.
Your altitude — climbing toward senior
ZeroJuniorMiddleSenior
You are at senior altitude — in orbit
◷ 14 min
Search bugs live in the gap between what you think the tokens are and what the engine actually stored. Read each snippet, predict the tokens or the score, and choose the fix a senior would reach for first.
Goal
Practise the loop you run in every search incident: trace text through the analyzer, picture the posting lists it produces, reason about how BM25 will order the candidates, and read a Postgres tsvector query well enough to know whether it can even use the index.
Snippet 1 — building the inverted index
index = {} # term -> sorted list of doc idsdef add(doc_id, text): for tok in analyze(text): # tokenize + lowercase + stem index.setdefault(tok, []) index[tok].append(doc_id) # append, no dedup, no sortdef search(q): lists = [index.get(t, []) for t in analyze(q)] return set.intersection(*map(set, lists)) # AND of terms
Quiz
Completed
A document that repeats a term (e.g. 'run run run') is added once. What is wrong with the resulting posting list, and what is the highest-leverage fix?
Heads-up The search-time set cast hides the duplicates for matching, but the stored list is still bloated and the term-frequency information is mangled, which is exactly what a real ranker needs. Fix the structure, not just the read path.
Heads-up Lowercasing is correct and deliberate; it is what lets 'Run' and 'run' match. The defect is the duplicate-id append with no term-frequency count.
Heads-up AND vs OR is a product decision about match strictness, not a bug. The actual defect is the posting-list construction storing duplicates and dropping the tf count.
Snippet 2 — the analyzer at index time vs query time
# index timedef index_analyzer(text): return [stem(t) for t in lower(tokenize(text)) if t not in STOPWORDS]# query time (a different service, written later)def query_analyzer(text): return [t for t in lower(tokenize(text))] # no stopword drop, no stem
Quiz
Completed
Documents indexed with index_analyzer; queries run through query_analyzer. A user searches 'running shoes'. What happens, and what is the fix?
Heads-up Shared tokenize/lowercase is not enough — the index stemmed and dropped stopwords while the query did neither, so the surviving tokens differ and never match. Both stages must be identical, not just overlapping.
Heads-up Keeping stopwords here does not over-match; the real outcome is under-match (zero results), because the content tokens were stemmed at index time and not at query time, so they never line up.
Heads-up b normalizes document length for scoring matched docs; it cannot rescue a tokenization mismatch that produces no match at all. The fix is analyzer parity, not a scoring knob.
Snippet 3 — a BM25 term-frequency sketch
def bm25_tf(tf, doc_len, avg_len, k1=1.2, b=0.75): # saturating term frequency with length normalization denom = tf + k1 * (1 - b + b * doc_len / avg_len) return tf * (k1 + 1) / denom
Quiz
Completed
A spam doc repeats a term 200 times; a clean doc has it 3 times at the same length. Reading this formula, which statement is correct?
Heads-up tf is in the numerator and the denominator, so the ratio saturates toward k1+1 rather than growing linearly. That non-linear saturation is the entire point of BM25.
Heads-up b=0 removes length normalization entirely, so document length stops affecting the score — it neither boosts nor penalizes long docs. b=1 applies full length normalization.
Heads-up k1 controls how fast term frequency saturates; b controls length normalization. The two knobs are distinct, and conflating them is a classic tuning error.
Snippet 4 — a Postgres full-text query
-- column: body text; index: CREATE INDEX ON docs USING gin(to_tsvector('english', body));SELECT id, ts_rank(to_tsvector('english', body), q) AS rankFROM docs, plainto_tsquery('english', 'running shoes') qWHERE to_tsvector('english', body) @@ qORDER BY rank DESC LIMIT 10;
Quiz
Completed
This query is correct but does a sequential scan on a large table despite the GIN index existing. Why, and what is the fix?
Heads-up GIN fully supports the @@ match operator and is the faster choice for read-heavy text. The issue is the index not being matched/used here, not an operator incompatibility.
Heads-up ts_rank runs only on the matched rows during ordering; it does not dictate the access path. The scan comes from the WHERE expression not lining up with a stored, indexed tsvector.
Heads-up plainto_tsquery is a valid way to build the query tsquery and does not disable the index. The recompute-per-row to_tsvector in WHERE is what keeps the planner from using the GIN index efficiently.
Recap
Every search bug reads back to tokens and access paths: a posting list must dedup doc ids and carry a term-frequency count, not append blindly; index-time and query-time analyzers must be byte-for-byte the same or matches silently vanish; BM25’s tf saturates toward k1+1 while b normalizes length, so spam cannot dominate; and a Postgres FTS query only uses its GIN index when the query expression matches the indexed expression — store a generated tsvector column rather than recomputing per row. Trace the tokens, picture the posting lists, then fix the structure before you touch a scoring knob.