Vocabulary for ML Experiment Tracking and MLOps

Master the English vocabulary for ML experiment tracking and MLOps workflows. Learn key terms used in professional ML engineering teams.

Working in machine learning engineering means living in a world of rapid experimentation, version control for data and models, and complex deployment pipelines. If you are collaborating with English-speaking teams or reading documentation from tools like MLflow, Weights & Biases, or DVC, having precise vocabulary is not optional — it is essential. This guide covers the core English terms used in ML experiment tracking and MLOps, with realistic examples of how professionals use them in conversations, code reviews, and design documents.

What Is MLOps and Why Does the Vocabulary Matter?

MLOps (Machine Learning Operations) refers to the set of practices that combine machine learning, DevOps, and data engineering to deploy and maintain ML systems reliably in production. The field has its own dense vocabulary that blends traditional software engineering terms with statistical and data-science terminology.

When a team member says “we need to instrument this run,” they are not talking about musical instruments — they mean adding tracking code to a training script. Understanding these terms helps you participate confidently in design reviews, stand-ups, and architecture discussions.

Key Vocabulary: Experiment Tracking

Run and Experiment

In most tracking frameworks, an experiment is a named collection of runs. A run is a single execution of a training script with a specific set of parameters.

“I kicked off three runs this morning with different learning rates. Check the experiment dashboard and compare the validation loss curves.”

A run ID uniquely identifies each execution, making it reproducible. When someone says “can you share the run ID for that baseline?” they want to be able to reproduce your exact training environment.

Parameters, Metrics, and Artifacts

These three concepts form the backbone of experiment tracking:

  • Parameters (or hyperparameters) are the inputs to a training run — learning rate, batch size, number of epochs.
  • Metrics are the outputs you measure during training — accuracy, F1 score, perplexity, BLEU score.
  • Artifacts are files produced by the run — model checkpoints, evaluation plots, confusion matrices, exported ONNX files.

“Log the precision and recall as metrics, and upload the serialised model as an artifact at the end of each epoch.”

Model Registry

A model registry is a centralised store where trained models are catalogued, versioned, and assigned a lifecycle stage.

Common stages include Staging, Production, and Archived. Teams promote a model from one stage to the next after validation.

“The QA team signed off on the challenger model. I’ll promote it to Production in the registry and deprecate the current champion.”

The terms champion and challenger describe the currently deployed model versus a candidate replacement — a widely used pattern in A/B testing for ML models.

Key Vocabulary: Data and Pipeline Versioning

Data Version Control

DVC (Data Version Control) and similar tools treat datasets like source code — every version is tracked and reproducible. Key phrases include:

  • dvc push / pull — synchronise data with remote storage
  • data pipeline — the sequence of transformations applied to raw data before training
  • feature store — a centralised repository of computed features shared across models

“Before you start training, pull the latest version of the dataset from DVC. We updated the feature engineering pipeline last week and the schema changed.”

Data Lineage and Provenance

Data lineage describes where data came from and how it was transformed. Data provenance is the documented history of a dataset’s origin and processing steps. These terms come up heavily in compliance and reproducibility discussions.

“The audit team is asking us to demonstrate data lineage for the fraud detection model. We need to trace every transformation from raw transaction logs to the final feature matrix.”

Key Vocabulary: Model Deployment and Serving

Model Serving and Inference

A deployed model is accessed through a serving endpoint or inference server. Key vocabulary includes:

  • online inference (also real-time inference) — predictions made on individual requests with low latency
  • batch inference — predictions run on a large dataset at scheduled intervals
  • model latency — the time from request to prediction response
  • throughput — the number of predictions per second the system can handle

“The product team wants sub-100ms latency for the recommendation API. We may need to quantise the model or switch to a lighter architecture.”

Drift Detection

Data drift occurs when the statistical distribution of input data in production diverges from the training distribution. Concept drift is when the relationship between inputs and the correct output changes over time.

“We’re seeing feature drift on the age distribution input. Production traffic skews younger than our training set. We should retrigger retraining.”

Model degradation is the resulting drop in model performance when drift is not addressed.

Key Vocabulary: CI/CD for ML

MLOps adapts traditional CI/CD (Continuous Integration / Continuous Deployment) for machine learning workflows:

  • CT (Continuous Training) — automatically retraining models when new data arrives or performance drops below a threshold
  • model evaluation gate — an automated check that a new model must pass before it can be promoted to production
  • shadow mode — running a new model in parallel with the production model, logging its predictions without serving them to users

“We’ve set up a model evaluation gate in the pipeline. If the new model’s AUC drops more than two percentage points below the baseline, the deployment is blocked automatically.”

Practical Phrases for MLOps Meetings

Use these phrases in daily stand-ups, sprint reviews, and design discussions:

  • “The experiment is tracked in MLflow under the fraud-detection-v2 experiment.”
  • “We need to pin the dataset version before we cut the release branch.”
  • “The model passed the evaluation gate but I want a human review before we promote it to Production.”
  • “We’re seeing significant concept drift — the model’s precision on new account fraud has dropped from 0.91 to 0.78 over the last two weeks.”
  • “Let’s set up a retraining trigger based on the rolling PSI score on the income feature.”

PSI (Population Stability Index) is a statistical measure of how much a distribution has shifted — a common drift metric in financial ML.

Summary

Mastering MLOps vocabulary means more than memorising terms. It means understanding the mental models behind them: that models are versioned software, that data has provenance, and that production ML systems require the same operational rigour as any other distributed service. The language professionals use reflects these ideas — and using it precisely signals that you think about machine learning the right way.