What does an inverted index do in a search engine?
An inverted index is the core data structure of full-text search. Instead of scanning every document for a query term (slow), it precomputes a mapping from term → list of documents containing it (a "postings list"). To answer a query, the engine looks up each term and intersects/unions the postings lists. It is called "inverted" because it inverts the natural document→terms relationship into terms→documents. Lucene, Elasticsearch, and most search systems are built on this structure, often augmented with term frequencies and positions for ranking and phrase queries.
2 / 5
What does TF-IDF measure?
TF-IDF scores how important a term is to a specific document. Term Frequency rewards terms that appear often in the document. Inverse Document Frequency down-weights terms that appear in many documents (like "the" or "data") because they are not distinctive. Multiplying them gives high scores to terms that are frequent here but rare elsewhere — exactly the terms that characterize a document. It is a foundational relevance signal, though modern engines build on its successor BM25 and increasingly combine it with semantic vector signals.
3 / 5
What is BM25 and why is it preferred over plain TF-IDF?
BM25 (Best Matching 25) is the de facto standard lexical ranking function. It improves on TF-IDF in two key ways: saturation — additional occurrences of a term yield diminishing returns (the 50th occurrence does not matter as much as the 5th), and length normalization — it accounts for document length so long documents do not unfairly score higher just by containing more words. Tunable parameters (k1, b) control these effects. It is the default scoring in Elasticsearch/Lucene and the strong baseline that semantic search is measured against.
4 / 5
What is the difference between dense (vector) retrieval and sparse (lexical) retrieval?
Sparse (lexical) retrieval — BM25, keyword matching — represents text as high-dimensional sparse vectors of term weights and matches on shared terms. It is precise for exact terms but misses synonyms and paraphrases ("car" vs "automobile"). Dense retrieval embeds queries and documents into low-dimensional vectors via a neural model, then finds nearest neighbors by similarity (e.g. cosine). It captures semantic meaning, matching related concepts without shared words, but can miss exact terms/rare entities. Hybrid search combines both for the best of each.
5 / 5
What is a "two-stage retrieval" (retrieve-then-rerank) architecture?
Two-stage retrieval balances speed and quality. The first stage (recall) uses a cheap method — BM25 or approximate nearest-neighbor vector search — to quickly narrow millions of documents to a few hundred candidates. The second stage (rerank) applies a more expensive, more accurate model (often a cross-encoder that jointly reads query and document) to reorder just those candidates. This is efficient because the expensive model only scores a small set. It is the standard architecture for high-quality search and RAG retrieval pipelines.