Vocabulary for NLP Engineers: 22 Terms Every Language Engineer Should Know

Learn the essential English vocabulary of natural language processing — tokenization, embeddings, named entity recognition, and more for NLP engineers.

Natural language processing engineering has a rich, precise vocabulary describing how text gets broken down, represented, and understood by models. Because this field deals directly with language itself, imprecise terminology is especially noticeable — using the wrong term while discussing language technology can undercut your credibility in a way that other engineering domains might forgive. This guide covers the 22 essential NLP terms with clear definitions and usage examples.

Text Processing Fundamentals

1. Tokenization

The process of splitting raw text into smaller units — tokens — which can be words, subwords, or characters, depending on the tokenizer used.

Usage: “This tokenizer splits contractions like ‘don’t’ into two tokens, which is causing a mismatch with our downstream rule-based system.”

2. Subword tokenization

A tokenization strategy that breaks rare or unknown words into smaller, more frequent subword units, allowing a model to handle vocabulary it never saw whole during training.

Usage: “The word ‘tokenization’ itself gets split into subwords like ‘token’ and ‘##ization’ by this particular tokenizer.”

3. Stemming vs. lemmatization

Stemming crudely chops word endings using rules (often producing non-words); lemmatization uses linguistic knowledge to reduce a word to its proper dictionary base form.

Usage: “Stemming reduced ‘running’ to ‘run’ but also reduced ‘university’ to the nonsensical ‘univers’ — we switched to lemmatization for cleaner results.”

4. Stop words

Common, typically low-information words (like “the,” “is,” “and”) that are sometimes filtered out during preprocessing in traditional NLP pipelines.

Usage: “We removed stop words for this keyword extraction task, but kept them for the sentiment model since negation words like ‘not’ matter there.”

5. Named entity recognition (NER)

The task of identifying and classifying spans of text into predefined categories such as people, organizations, locations, and dates.

Usage: “The NER model correctly tagged ‘Amazon’ as an organization in this sentence, but misclassified it as a location in another.”

Representation and Embeddings

6. Embedding

A dense, numerical vector representation of a word, sentence, or document, positioned in a continuous space such that semantically similar items end up close together.

Usage: “The embeddings for ‘king’ and ‘queen’ are close in vector space, but the embedding for ‘bank’ sits between its financial and riverbank senses since it’s not context-aware.”

7. Contextual embedding

An embedding that varies depending on the surrounding context, so the same word gets a different vector representation depending on its meaning in a given sentence — as opposed to static embeddings.

Usage: “Unlike static embeddings, the contextual embedding for ‘bank’ differs between ‘river bank’ and ‘savings bank,’ because the model considers the surrounding words.”

8. Vector database

A specialized database optimized for storing and efficiently searching high-dimensional embedding vectors, typically used for semantic search and retrieval.

Usage: “We store document embeddings in a vector database so we can retrieve semantically similar passages, not just keyword matches.”

9. Semantic similarity

A measure of how close two pieces of text are in meaning, typically computed using the distance or angle between their embeddings.

Usage: “Semantic similarity scored these two support tickets as nearly identical even though they used completely different wording.”

10. Cosine similarity

A specific metric commonly used to measure semantic similarity between two embedding vectors, based on the angle between them rather than their magnitude.

Usage: “We rank retrieved passages by cosine similarity to the query embedding before passing the top results to the language model.”

Language Understanding Tasks

11. Part-of-speech (POS) tagging

Labeling each word in a sentence with its grammatical role — noun, verb, adjective, and so on.

Usage: “The POS tagger mislabeled ‘record’ as a noun in this sentence, but it’s actually being used as a verb.”

12. Coreference resolution

Determining which words or phrases in a text refer to the same entity, such as figuring out that “she” refers to “Maria” mentioned two sentences earlier.

Usage: “Coreference resolution linked ‘the company’ back to ‘Acme Corp’ three sentences prior, which let the summarizer keep the reference clear.”

13. Sentiment analysis

Classifying text according to the emotional tone or opinion it expresses, typically as positive, negative, or neutral.

Usage: “Sentiment analysis on these reviews flagged them as neutral, but a human reader would clearly recognize the sarcasm as negative.”

14. Intent classification

Determining the underlying goal or purpose behind a piece of text, commonly used in conversational systems to decide how to respond.

Usage: “The intent classifier correctly identified this message as a ‘cancel subscription’ request, even though the user never used the word ‘cancel.‘“

15. Text summarization (extractive vs. abstractive)

Extractive summarization selects and stitches together existing sentences from the source text; abstractive summarization generates new sentences that convey the same meaning.

Usage: “The extractive summary just pulled the first sentence of each paragraph, which missed the key point buried in the middle — an abstractive model handled this much better.”

Retrieval and Generation

16. Retrieval-augmented generation (RAG)

A technique that retrieves relevant documents or passages from an external knowledge source and provides them to a language model as context before it generates a response, improving factual grounding.

Usage: “We added RAG so the model can cite our actual documentation instead of hallucinating plausible-sounding but incorrect API parameters.”

17. Hallucination

When a language model generates text that is fluent and plausible-sounding but factually incorrect or unsupported by any source.

Usage: “The model hallucinated a function name that doesn’t exist in our SDK — it sounded exactly like our real naming conventions, which is what made it dangerous.”

18. Chunking

Splitting long documents into smaller, manageable pieces (chunks) before embedding and indexing them, typically for use in retrieval systems.

Usage: “We’re chunking documents by section heading rather than a fixed token count, which keeps semantically related content together.”

19. Reranking

A second-stage process that reorders an initial set of retrieved candidates using a more precise (and usually more expensive) model, to improve the final ranking quality.

Usage: “The initial retrieval returns 50 candidates quickly, and a reranker narrows that down to the top 5 most relevant passages.”

20. Perplexity (language modeling)

A metric measuring how well a language model predicts a given sequence of text — lower perplexity indicates the model finds the sequence more predictable.

Usage: “Perplexity on this domain-specific corpus was much higher than on general text, which suggested the model needed further domain adaptation.”

21. BLEU / ROUGE score

Automated metrics for evaluating generated text (like translations or summaries) by comparing overlap with one or more reference texts — BLEU is common for translation, ROUGE for summarization.

Usage: “The ROUGE score improved after fine-tuning, but a manual review still found the summaries missed some nuance the metric didn’t capture.”

22. Domain adaptation

The process of adjusting a general-purpose NLP model to perform better on text from a specific domain (like legal, medical, or technical writing) that differs from its original training distribution.

Usage: “We did domain adaptation on medical transcripts, since the general model kept misinterpreting clinical abbreviations as common words.”

Key Takeaways

  • Tokenization choices (subword vs. word-level) affect everything downstream — know which your system uses before debugging unexpected model behavior.
  • Embeddings and semantic similarity are the foundation of modern search and retrieval; distinguish static embeddings from contextual ones precisely.
  • RAG, chunking, and reranking are the standard vocabulary for describing retrieval pipelines that ground language model outputs in real documents.
  • Hallucination is a specific, well-defined failure mode — use it precisely rather than as a catch-all for “the model was wrong.”
  • Automated metrics like BLEU, ROUGE, and perplexity are useful signals but don’t replace human judgment — mention their limitations when reporting results.