Elasticsearch Vocabulary: 30 Terms for Search Engineers
Index, mapping, shards, relevance scoring, query DSL, and Elasticsearch vocabulary for search and data engineers.
Elasticsearch powers search at scale — from e-commerce product catalogues to log analytics pipelines. If you work with it daily, you already know the commands. But when you join an English-speaking team, stand-up calls and code reviews introduce a layer of jargon that can slow you down. This guide covers the 30 terms you will hear most often, with plain-English definitions and real developer dialogue so you can use them confidently in conversation.
Core Terms
Index — In Elasticsearch, an index is a collection of related documents, roughly analogous to a database in a relational system. You store, search, and manage data at the index level.
“We’re splitting the logs index by month so the hot tier doesn’t fill up so fast.”
“Before you run that query, make sure you’re targeting the right index — there’s a legacy one still sitting there from last year.”
Document — A document is the basic unit of data in Elasticsearch, stored as JSON. Every document belongs to an index and has a unique _id.
“The document structure changed in the last sprint — the
user_idfield is now nested undermetadata.”
“We’re indexing around two million documents a day, so mapping efficiency really matters.”
Shard — Elasticsearch divides an index into smaller pieces called shards. Each shard is a self-contained Lucene index. Sharding lets Elasticsearch distribute data across multiple nodes and parallelise queries.
“We over-sharded that index at the start — twenty shards for five gigabytes is overkill.”
“The query is slow because it’s hitting all twelve shards sequentially. We need to look at routing.”
Replica shard — A replica shard is an exact copy of a primary shard. Replicas serve two purposes: high availability (if a node goes down, replicas take over) and read throughput (search requests can be served from any copy).
“Bump the replica count to two before we go live — I don’t want a single-node failure taking down search.”
“Replica shards aren’t helping with indexing speed; they only help with reads and fault tolerance.”
Mapping — A mapping defines the schema for documents in an index: field names, data types (keyword, text, date, integer, etc.), and how each field should be analysed. Think of it as the typed schema that Elasticsearch uses to serialise and query your data.
“The mapping for that field is
text, but we need exact matches — change it tokeyword.”
“You can’t change an existing field’s mapping in place. You’ll need to reindex.”
Dynamic mapping vs explicit mapping — With dynamic mapping, Elasticsearch automatically detects and creates field types when you first index a document. With explicit mapping, you define the schema yourself before indexing. Dynamic mapping is convenient for prototyping but can create surprises in production; explicit mapping gives you precise control over field behaviour.
“We turned off dynamic mapping for this index. We had too many accidental
floatfields being created from string data.”
“Explicit mapping is a bit more upfront work, but it saves you from weird relevance scoring issues later.”
Indexing and Storage Internals
Inverted index — The inverted index is the core data structure that makes full-text search fast. Instead of storing documents and scanning them, Elasticsearch builds a lookup table from every unique term to the list of documents that contain it. When you search for “optimise”, Elasticsearch looks up that term in the inverted index and retrieves matching document IDs instantly.
“The reason keyword search is so fast is the inverted index — it’s not scanning row by row like a database would.”
“Stop words are removed during analysis so they don’t bloat the inverted index unnecessarily.”
_source field — The _source field stores the original JSON document as it was indexed. When you retrieve a document, Elasticsearch returns _source by default. You can disable or filter it to save storage, but doing so limits what you can fetch back without reindexing.
“We disabled
_sourceon the metrics index to cut storage in half, but now we can’t use update-by-query.”
“You can use
_sourcefiltering in your request to return only the fields you actually need.”
Refresh interval — Elasticsearch writes new documents to an in-memory buffer and periodically makes them visible to search via a process called a refresh. The refresh interval (default: one second) controls how often this happens. For high-throughput ingestion, increasing the interval reduces overhead; for near-real-time search, keep it low.
“Set the refresh interval to thirty seconds during the bulk load, then drop it back to one second when you’re done.”
“The data appears in the index but isn’t searchable yet — it’s still waiting on the refresh.”
Index Lifecycle Management (ILM) — ILM is Elasticsearch’s built-in policy engine for managing indices over time. You define phases — hot, warm, cold, frozen, delete — and Elasticsearch automatically moves indices between them, shrinking shards, force-merging segments, or deleting old data according to your rules.
“We’ve got an ILM policy that rolls over the index when it hits fifty gigabytes or thirty days, whichever comes first.”
“Without ILM, someone has to manually clean up old indices. It’s the kind of thing that gets forgotten until a disk fills up at 3 a.m.”
Querying and Relevance
Relevance score (BM25) — When you run a full-text query, Elasticsearch ranks results by relevance score. The default scoring algorithm is BM25 (Best Match 25), which considers term frequency, inverse document frequency, and field length. Higher scores mean closer matches.
“The top results look wrong — BM25 is boosting short documents too heavily because of field length normalisation.”
“You can explain a query to see how BM25 calculated each document’s score. Really useful for debugging ranking issues.”
Query DSL — Elasticsearch’s Query DSL (Domain-Specific Language) is a JSON-based language for expressing queries. Instead of writing SQL, you compose nested JSON objects describing what you want to find and how to rank results.
“The Query DSL looks verbose at first, but once you understand the leaf and compound query pattern, it clicks.”
“Don’t build Query DSL strings by concatenation — use a client library that constructs the JSON properly.”
term query — A term query matches documents that contain an exact, unanalysed value in a field. Use it for keyword fields, IDs, statuses, and other structured data where you need a precise match.
“Use a
termquery for thestatusfield, notmatch— you want exact string equality, not full-text analysis.”
match query — A match query runs the search string through the field’s analyser before comparing. This makes it suitable for full-text fields where you want tokenisation, stemming, and stop-word removal to apply.
“The
matchquery is forgiving — it analyses the input the same way the field was indexed, so ‘running’ matches ‘run’.”
bool query — A bool query lets you combine multiple queries using logical clauses: must (required, affects score), should (optional, boosts score), must_not (exclusion, no score contribution), and filter (required, no score contribution).
“Wrap your filters in the
filterclause of aboolquery — they’re cached and don’t affect scoring, so performance is much better.”
“The
shouldclause is what gives you that ‘nice to have’ boosting without hard-requiring the term.”
range query — A range query matches documents where a field’s value falls between specified bounds. Common with dates, prices, and numeric metrics.
“Use a
rangequery on thecreated_atfield to scope results to the last thirty days.”
Analysis and Text Processing
Analyser — An analyser is the pipeline Elasticsearch uses to process text when indexing and searching. It typically consists of a character filter (optional), a tokenizer, and one or more token filters. The choice of analyser determines what ends up in the inverted index.
“The default
standardanalyser lowercases and tokenises on whitespace and punctuation. It’s fine for most English content.”
“We’re using the
englishanalyser for the blog content — it handles stemming and stop words automatically.”
Tokenizer — The tokenizer is the component within an analyser that breaks a string into individual tokens (terms). The standard tokenizer splits on whitespace and punctuation; the whitespace tokenizer splits only on spaces; ngram tokenizers produce character-level fragments for partial matching.
“We switched to an
ngramtokenizer for the search-as-you-type feature so partial strings match correctly.”
“The tokenizer is just one part of the analysis chain — filters run afterwards to modify the tokens.”
Filter (token filter) — A token filter modifies, removes, or adds tokens after the tokenizer has run. Common filters include lowercase, stop (removes stop words), stemmer (reduces words to their root form), and synonym.
“Add a
synonymfilter so that ‘k8s’ matches ‘kubernetes’ in the index.”
“The
stopfilter is removing ‘not’ from the query, which is flipping the meaning entirely. Turn it off for this field.”
Aggregations
Aggregation — Aggregations let you compute analytics over your search results. Instead of just retrieving matching documents, you can count them, sum values, find averages, build histograms, and more. They are defined in the aggs section of a query.
“We use aggregations to power the faceted navigation on the product listing page — category counts, price ranges, brand filters.”
“If you only need the aggregation result and not the raw hits, set
size: 0to avoid fetching documents unnecessarily.”
Bucket aggregation — A bucket aggregation groups documents into buckets based on a criterion — a field value, a date range, a geographic boundary. Each bucket can contain further sub-aggregations.
“The
termsaggregation is a bucket aggregation — it creates one bucket per unique value of the field.”
“Nest a metric aggregation inside each bucket to get, say, average order value per country.”
Metric aggregation — A metric aggregation computes a single numeric value (or a set of values) from the documents in a bucket: sum, avg, min, max, cardinality, percentiles, and so on.
“The
cardinalityaggregation gives you an approximate distinct count. It’s not exact, but it’s fast enough for dashboards.”
How to Use These in Conversation
Scenario 1 — Explaining a performance issue in a stand-up:
“The dashboard queries are slow because we’re running a terms aggregation across the full index with no filter. I’m going to add a date range filter in the bool query’s filter clause so Elasticsearch can use the ILM-managed warm-tier index instead of scanning everything.”
Scenario 2 — Reviewing a mapping change in a pull request:
“This field should be keyword, not text. We’re only ever doing exact matches on it — using text means Elasticsearch will analyse it and create unnecessary entries in the inverted index. Also, dynamic mapping is still on for this index; let’s define an explicit mapping to prevent surprises.”
Scenario 3 — Debugging relevance ranking with a colleague:
“The BM25 scores look off for short documents. Can you run the _explain API on one of these results so we can see how the relevance score was calculated? I suspect the field length normalisation is over-penalising longer, more complete records.”
Scenario 4 — Discussing ILM during a capacity planning call: “Right now we’re keeping everything on the hot tier indefinitely, which is why storage costs are climbing. If we set up an ILM policy to move indices older than fourteen days to warm and delete after ninety, we can cut the SSD footprint by about sixty per cent.”
Quick Reference
| Term | What it means |
|---|---|
| Index | A named collection of documents; the top-level data container |
| Shard | A horizontal slice of an index; enables distribution and parallelism |
| Replica shard | A copy of a primary shard for redundancy and read throughput |
| Mapping | The schema defining field names, types, and analysis settings |
| Inverted index | The internal lookup structure mapping terms to document IDs |
| BM25 | The default relevance scoring algorithm for full-text queries |
| bool query | A compound query combining must, should, filter, must_not clauses |
| Analyser | The pipeline (tokenizer + filters) that processes text for indexing and search |
| Bucket aggregation | Groups documents into named buckets (e.g. by category, date range) |
| ILM | Policy engine that automates index lifecycle phases (hot → warm → delete) |
Mastering this vocabulary will not only help you communicate more precisely in English — it will sharpen how you think about Elasticsearch itself. Each term reflects a design decision, and understanding the language is the first step to reasoning clearly about search architecture.