Data Engineering DATA · 06 · 09

Full-text search: code and query reading

Read real index-build code, an analyzer pipeline, a BM25 scoring sketch, and a Postgres tsvector query, then predict the behaviour and pick the highest-leverage fix.

DATA Senior ◷ 14 min

Level

FoundationsJuniorMiddleSenior

Search bugs live in the gap between what you think the tokens are and what the engine actually stored. Read each snippet, predict the tokens or the score, and choose the fix a senior would reach for first.

Goal

Practise the loop you run in every search incident: trace text through the analyzer, picture the posting lists it produces, reason about how BM25 will order the candidates, and read a Postgres tsvector query well enough to know whether it can even use the index.

Snippet 1 — building the inverted index

index = {}                      # term -> sorted list of doc ids
def add(doc_id, text):
    for tok in analyze(text):   # tokenize + lowercase + stem
        index.setdefault(tok, [])
        index[tok].append(doc_id)   # append, no dedup, no sort

def search(q):
    lists = [index.get(t, []) for t in analyze(q)]
    return set.intersection(*map(set, lists))   # AND of terms

Quiz

A document that repeats a term (e.g. 'run run run') is added once. What is wrong with the resulting posting list, and what is the highest-leverage fix?

Snippet 2 — the analyzer at index time vs query time

# index time
def index_analyzer(text):
    return [stem(t) for t in lower(tokenize(text)) if t not in STOPWORDS]

# query time (a different service, written later)
def query_analyzer(text):
    return [t for t in lower(tokenize(text))]   # no stopword drop, no stem

Quiz

Documents indexed with index_analyzer; queries run through query_analyzer. A user searches 'running shoes'. What happens, and what is the fix?

Snippet 3 — a BM25 term-frequency sketch

def bm25_tf(tf, doc_len, avg_len, k1=1.2, b=0.75):
    # saturating term frequency with length normalization
    denom = tf + k1 * (1 - b + b * doc_len / avg_len)
    return tf * (k1 + 1) / denom

Quiz

A spam doc repeats a term 200 times; a clean doc has it 3 times at the same length. Reading this formula, which statement is correct?

Snippet 4 — a Postgres full-text query

-- column: body text;  index: CREATE INDEX ON docs USING gin(to_tsvector('english', body));
SELECT id, ts_rank(to_tsvector('english', body), q) AS rank
FROM docs, plainto_tsquery('english', 'running shoes') q
WHERE to_tsvector('english', body) @@ q
ORDER BY rank DESC LIMIT 10;

Quiz

This query is correct but does a sequential scan on a large table despite the GIN index existing. Why, and what is the fix?

Recap

Every search bug reads back to tokens and access paths: a posting list must dedup doc ids and carry a term-frequency count, not append blindly; index-time and query-time analyzers must be byte-for-byte the same or matches silently vanish; BM25’s tf saturates toward k1+1 while b normalizes length, so spam cannot dominate; and a Postgres FTS query only uses its GIN index when the query expression matches the indexed expression — store a generated tsvector column rather than recomputing per row. Trace the tokens, picture the posting lists, then fix the structure before you touch a scoring knob.

Something unclear?

Ask a question about this lesson. Questions are anonymous and go straight to the author to make the lesson better.