MongoDB & Document Database Vocabulary: 25 Terms Explained
Documents, collections, BSON, aggregation pipeline, indexes, and MongoDB vocabulary for backend developers.
If you work on a backend team that uses MongoDB, you have probably heard phrases like “shard the collection by user ID” or “run it through the aggregation pipeline” and nodded along while quietly wondering what was actually being said. MongoDB has its own vocabulary — borrowed partly from relational databases, partly from distributed systems, and partly from its own design decisions. This post breaks down 25 essential terms so you can follow technical discussions, read pull request comments, and write accurate documentation without second-guessing yourself.
Core Terms
Document — the fundamental unit of data in MongoDB. A document is a set of key-value pairs stored in a flexible, JSON-like structure. Unlike a row in a relational table, a document can contain nested objects and arrays, and no two documents in the same collection need to share an identical shape.
“This endpoint returns a single user document, so just call
findOneand pass it straight to the serialiser.”
“The document schema changed last sprint — there are now two possible shapes in production, so the parser needs to handle both.”
Collection — a grouping of documents, roughly analogous to a table in SQL. A collection does not enforce a fixed schema by default, though you can add validation rules.
“We moved the audit logs into a separate collection to keep the main orders collection lean.”
BSON — Binary JSON. MongoDB stores documents on disk and transmits them over the wire in BSON format rather than plain text JSON. BSON supports additional data types (such as dates and binary data) that standard JSON does not. As a developer you rarely interact with BSON directly, but it is worth knowing the term when reading driver documentation.
“The driver serialises your Python dict to BSON before writing — that is why the date field arrives as an ISODate rather than a plain string.”
ObjectId — the default type MongoDB uses for the _id field. An ObjectId is a 12-byte value that encodes a timestamp, a machine identifier, and a random component, making collisions extremely unlikely even across distributed nodes.
“If you are generating IDs on the client side, make sure you are using a proper ObjectId rather than a random string, otherwise the index won’t sort chronologically.”
Embedded document vs reference — two strategies for modelling relationships. An embedded document nests related data directly inside a parent document (good for data you always read together). A reference stores only the _id of a related document in another collection and requires a separate query or a $lookup to resolve it (good for data that is large, frequently updated independently, or shared across many documents).
“We debated embedding the address in the user document, but since multiple orders reference the same address we went with a reference instead.”
“For the product images we are embedding the URLs — they are small and we always need them when we fetch the product.”
Query & Aggregation
Aggregation pipeline — a sequence of processing stages that transform a set of documents into a result. Each stage receives the output of the previous stage, allowing you to filter, reshape, group, and join data in a single operation.
“The reporting endpoint is too slow because it is doing all the grouping in application code. Let’s move that logic into an aggregation pipeline.”
$match — a pipeline stage that filters documents using query conditions, similar to a SQL WHERE clause. Placing $match early in a pipeline reduces the number of documents passed to later stages.
“Add a
$matchat the top of the pipeline to filter bystatus: 'active'— right now we are pulling the entire collection into memory.”
$group — a pipeline stage that groups documents by a specified expression and can compute aggregate values such as sums, averages, and counts.
“The
$groupstage calculates total revenue per region, then$sortorders the results descending.”
$project — a pipeline stage that reshapes each document, including or excluding fields, renaming them, or computing new values. It is equivalent to a SQL SELECT.
“Use
$projectto strip out the internal audit fields before the response leaves the API layer.”
$lookup — a pipeline stage that performs a left outer join between the current collection and another collection in the same database, adding matching documents as an array field.
“We replaced the two-query pattern with a
$lookupso the whole thing runs server-side in one round trip.”
Covered query — a query where all the fields requested in the filter and the projection are present in an index. MongoDB can satisfy a covered query entirely from the index without reading the underlying documents, which is significantly faster.
“I checked the explain plan and it is a covered query — no document fetches at all, just an index scan.”
Explain plan — the output of cursor.explain() or db.collection.explain(), which describes how MongoDB executed (or plans to execute) a query. The explain plan shows whether an index was used, how many documents were examined, and where time was spent.
“Before deploying that query to production, run an explain plan in staging and make sure it is not doing a COLLSCAN.”
Indexes
Index — a data structure that MongoDB maintains alongside a collection to speed up queries. Without an index, MongoDB must scan every document in a collection to find matches (a collection scan). Indexes come in several types.
Single-field index — an index on one field. The most common type, and a good starting point when you consistently query or sort by that field.
“There is no index on
createdAt— add a single-field index and the date-range queries should drop from seconds to milliseconds.”
Compound index — an index on two or more fields. Field order matters: MongoDB can use a compound index to satisfy queries on a prefix of the indexed fields but not on a suffix alone.
“We have a compound index on
userIdandcreatedAt, so sorting by date per user is fast, but querying bycreatedAtalone won’t use it.”
Multikey index — an index on a field whose value is an array. MongoDB creates a separate index entry for each element of the array, enabling efficient queries on array contents.
“The
tagsfield is an array, so MongoDB automatically creates a multikey index — you can query for any tag without scanning documents.”
Text index — a specialised index that tokenises string fields to support full-text search with the $text operator, including stemming and stop-word filtering.
“We added a text index on the
descriptionfield so users can search by keyword without us spinning up Elasticsearch.”
Geospatial index — an index optimised for location data. The 2dsphere index supports queries on GeoJSON objects (points, lines, polygons) for operations like “find all venues within 5 km.”
“The store locator uses a
2dsphereindex on thelocationfield — the$nearquery returns results sorted by distance automatically.”
Reliability & Scale
Write concern — a setting that controls how many members of a replica set must acknowledge a write operation before MongoDB reports it as successful. A higher write concern increases durability at the cost of latency.
“We set write concern to
majorityon the payments collection — we would rather take the extra milliseconds than risk losing a transaction acknowledgement.”
Read preference — a setting that controls which member of a replica set your application reads from. primary (default) reads from the primary only; secondary allows reads from replica members, which can reduce load on the primary but may return slightly stale data.
“The analytics queries are running on
secondaryPreferredto keep load off the primary — a few seconds of staleness is fine for dashboards.”
Replica set — a group of MongoDB instances (typically three or more) that maintain the same dataset. One member is the primary (accepts writes); the others are secondaries (replicate from the primary). If the primary becomes unavailable, the secondaries elect a new primary automatically.
“We had a brief outage last night when the primary went down, but the replica set elected a new primary in about ten seconds and everything recovered.”
Sharding — a method of distributing data across multiple servers (shards) so that no single machine holds the entire dataset. Sharding is MongoDB’s horizontal scaling strategy.
“Once the collection grows past a few hundred gigabytes we will need to think about sharding — the current single-shard setup won’t hold.”
Shard key — the field (or compound of fields) used to determine which shard a document belongs to. Choosing the right shard key is critical: a poor choice leads to uneven data distribution (hotspots).
“The team spent a week debating the shard key — we settled on a compound of
tenantIdandcreatedAtto get even distribution and locality.”
Chunk — in a sharded cluster, MongoDB divides the key space into contiguous ranges called chunks and assigns each chunk to a shard. The balancer moves chunks between shards to maintain even distribution.
“The balancer was running at peak traffic because too many chunks had migrated to one shard — we adjusted the chunk size to reduce the churn.”
Change stream — a feature that allows applications to subscribe to a real-time stream of data changes (inserts, updates, deletes) on a collection, database, or entire cluster. Change streams are built on MongoDB’s oplog and support resumable consumption.
“We replaced the polling loop with a change stream — now the notification service reacts to new orders in under a second instead of on a 30-second interval.”
How to Use These in Conversation
Discussing a schema design decision:
“I am not sure whether to embed the line items or reference a separate
orderItemscollection. Embedding keeps it to one document fetch, but a large order could push the document past the 16 MB BSON limit. What do you think?”
Reviewing a slow query:
“I ran the explain plan on the leaderboard query and it is doing a COLLSCAN because there is no compound index on
gameIdandscore. If we add that index, it should become a covered query and response time should drop dramatically.”
Planning for scale:
“Once we hit around 500 GB of user data we should start thinking about sharding. The tricky part is choosing a shard key that gives us even distribution — if we pick
userIdalone we might get hotspots for power users.”
Proposing a real-time feature:
“Instead of polling the database every minute for new job postings, we could open a change stream on the
jobscollection. The UI would update in near real-time and we would eliminate a lot of unnecessary read load.”
Quick Reference
| Term | One-line definition |
|---|---|
| Document | A single BSON record — the basic unit of storage |
| Collection | A group of documents (like a table, but schema-flexible) |
| ObjectId | Default auto-generated _id; encodes timestamp + machine ID |
| Aggregation pipeline | A chain of stages that filter, transform, and join documents |
| Covered query | A query resolved entirely from an index — no document reads |
| Explain plan | Query execution report; shows index usage and document scan counts |
| Write concern | How many replica members must confirm a write before it is “done” |
| Read preference | Which replica member to read from (primary vs secondary) |
| Replica set | A cluster of MongoDB nodes sharing the same data for high availability |
| Change stream | Real-time subscription to insert/update/delete events on a collection |