LLMOps in English: Vocabulary for Deploying and Monitoring Language Models

Expand your LLMOps vocabulary in English — prompt versioning, RAG, evaluation harnesses, hallucination monitoring, and cost-per-token language for AI engineers.

LLMOps: A New Domain, New English Vocabulary

LLMOps — the operational practice of deploying, monitoring, and maintaining Large Language Models in production — is one of the fastest-growing specialisations in software engineering. The vocabulary is evolving rapidly, and many terms have no direct equivalent in other languages. Mastering these terms in English is essential if you work in AI engineering or want to follow international research and tooling discussions.

Core LLMOps Vocabulary

Prompt Versioning

Prompts are not static — they change frequently as you tune model behaviour. Prompt versioning is the practice of tracking changes to prompts in a version control system, much like source code.

“We version all system prompts in our Git repository and use semantic versioning to track breaking changes.”

Related terms:

  • Prompt template — a reusable prompt structure with placeholder variables.
  • System prompt — the instructions given to the model before the user’s input.
  • Prompt registry — a centralised store of versioned prompts.

Retrieval-Augmented Generation (RAG)

RAG is an architecture where the model retrieves relevant context from an external knowledge base before generating a response. This reduces hallucination and keeps the model’s knowledge current without retraining.

  • Vector store — a database that stores text as numerical embeddings for semantic similarity search.
  • Retrieval pipeline — the sequence of steps that fetches, ranks, and injects context into the prompt.
  • Chunking — splitting documents into smaller pieces for indexing and retrieval.

Evaluation Harness

An evaluation harness is a test framework that automatically scores model outputs against a set of reference answers or criteria. This is analogous to a unit test suite for traditional software.

  • Groundedness — whether the model’s answer is supported by the retrieved context.
  • Faithfulness — whether the response accurately reflects the source material without adding fabricated detail.
  • Latency budget — the maximum acceptable response time for a model call.

Hallucination Monitoring

Hallucination in LLMs refers to the model generating confident but factually incorrect or entirely fabricated information. In production, teams implement monitoring to detect and alert on hallucinations.

  • Hallucination rate — the percentage of responses that contain factually incorrect or unsupported claims.
  • Guard rail — a rule or classifier applied to model output to catch unsafe or incorrect responses before they reach users.
  • Output validation — programmatic checks applied to model responses to verify format, completeness, or factual consistency.

Cost Per Token

Running LLMs at scale is expensive. Cost per token is the fundamental unit of LLM pricing — you pay for both input tokens (the prompt) and output tokens (the generated response).

  • Token budget — the maximum number of tokens allocated to a single request or workflow.
  • Context window — the maximum number of tokens a model can process in a single call.
  • Caching — storing previous prompt-response pairs to avoid redundant API calls and reduce cost.

Operational Language

Use these phrases in standups, postmortems, and architecture discussions:

  • “The evaluation harness flagged a regression in groundedness after the last prompt update.”
  • “We are tracking hallucination rate as a key reliability metric in our LLM dashboard.”
  • “The retrieval pipeline is the primary latency bottleneck — p99 is currently 1.8 seconds.”
  • “We need to optimise the prompt template to stay within the token budget for long documents.”

Five Example Sentences

  1. “After deploying the new RAG pipeline, the hallucination rate dropped from 12% to 3% on our internal benchmark.”
  2. “We store all prompt templates in a versioned registry so that any team member can roll back to a previous version if a model update degrades quality.”
  3. “The evaluation harness runs automatically on every pull request, scoring each prompt variant against a curated set of golden answers.”
  4. “Our cost-per-token analysis showed that switching to a smaller model for classification tasks reduced monthly spend by 40%.”
  5. “Guard rails are applied at the output layer to ensure the model does not return responses outside the permitted topic scope.”

Staying Current

LLMOps terminology is standardising quickly. Follow the documentation of tools like LangSmith, MLflow, and Weights & Biases to encounter these terms in authentic context. Reading engineering blogs from companies running LLMs at scale — such as Anthropic, OpenAI, and Cohere — is an excellent way to see how these concepts are described in professional English.