Advanced 7 terms

Search Engineering

Elasticsearch vocabulary: inverted index, shards, query DSL, relevance scoring (BM25), analyzers, dense vector search, and operational concepts.

  • Inverted Index /ɮnˈvɜːtɪd ˈɮndɛks/

    A data structure mapping terms to the documents that contain them — the core of full-text search. For each unique term in the corpus, the index stores the list of document IDs (and optionally positions and frequencies). Enables fast lookup of “which documents contain this term.”

    "The query for ‘microservices resilience’ hits the inverted index: look up ‘microservices’ → {doc1, doc5, doc12}, look up ‘resilience’ → {doc2, doc5, doc8}. The intersection is {doc5}, then BM25 scores it. Building the inverted index takes time but makes queries millisecond-fast even over millions of documents."
  • Shard /ʃɑːd/

    A horizontal partition of an Elasticsearch index. Each shard is a self-contained Lucene index. Shards enable parallel indexing and querying across nodes. Primary shards store the data; replica shards are copies for fault tolerance.

    "We set the index to 5 primary shards. With 5 nodes, each node holds one primary shard and one replica of a different shard. Queries fan out to all shards in parallel, results are merged by the coordinating node. After initial creation, the number of primary shards cannot be changed — over-provision slightly to avoid reindexing later."
  • BM25 (Best Match 25) /biː ɛm tʊɛntɪfaIˈv/

    The default relevance scoring algorithm in Elasticsearch (replacing TF-IDF). Scores documents based on term frequency (with diminishing returns for high repetition), inverse document frequency (rare terms score higher), and document length normalisation.

    "BM25 scores the results: the document mentioning ‘kubernetes operator pattern’ in the title scores higher than one mentioning ‘kubernetes’ 50 times in the body — BM25’s term saturation prevents keyword stuffing from gaming relevance. We boosted the title field ^2 to give title matches more weight."
  • Query DSL /kwɪri dɪ ɛs ɛl/

    Elasticsearch’s JSON-based query language for constructing complex searches. The bool query is the building block: must (all conditions required, affects score), should (optional, boosts score), must_not (exclusion), filter (exact match, no scoring, cached).

    "The search query uses a bool: must contains a multi_match across title and content, filter ensures status:published and date range, should adds a boost for premium content. Filters are cached — the status and date filter runs fast on every query. The must clause does the scoring."
  • Analyzer /ˈænəlaɪzər/

    A text processing pipeline in Elasticsearch applied at index and query time. Consists of a character filter (clean raw text), tokenizer (split into tokens), and token filters (lowercase, stop words, stemming, synonyms). Determines how text is indexed and how queries are parsed.

    "The English analyzer lowercases, removes stop words (‘the’, ‘a’, ‘is’), and stems tokens (‘running’ → ‘run’). At query time, the same analyzer is applied to search terms. This means a query for ‘running applications’ matches documents containing ‘run application’. Mismatch between index and query analyzers is a common relevance bug."
  • Dense Vector Search / kNN /dɛns ˈvɛktər sɜːtʃ / kɛn ɛn/

    Searching by semantic similarity using embedding vectors. Each document is converted to a high-dimensional vector by an embedding model; queries are also vectorised. kNN (k-nearest neighbours) finds the k most similar documents by vector distance (cosine, dot product).

    "We added semantic search using dense vector search. Each knowledge base article is embedded with a sentence transformer model and indexed as a dense_vector field. When a user searches, we embed the query and run a kNN search — documents about ‘pod eviction’ surface when someone searches ‘kubernetes deleted my pod,’ even without exact keyword overlap."
  • Index Lifecycle Management (ILM) /ˈɮndɛks ˈlaɪfsaɪkl ˈmænɪdʒmənt/

    Automated management of time-series indices through lifecycle phases: hot (active write/read), warm (read-only, optimised), cold (infrequent access, compressed), frozen (near-zero cost), delete. Used for logs, metrics, traces with time-based retention.

    "Our log index ILM policy: hot phase for 7 days (primary + 1 replica, frequent queries), warm after 7 days (read-only, merge to 1 segment per shard, reduce replicas), cold after 30 days (smaller allocation), delete after 90 days. Without ILM, old indices would consume expensive hot tier storage indefinitely."