English Vocabulary for ML Experiment Tracking

Master the English vocabulary for ML experiment tracking with MLflow, Weights and Biases, and Comet ML used in professional machine learning teams.

Machine learning teams move fast, and experiment tracking is what keeps everyone aligned on what was tried, what worked, and why. If you’re joining an ML team or working with MLOps tooling, you need to understand not just the tools but the vocabulary used when discussing experiments in standups, design reviews, and model handoffs. This post covers the key English terms and phrases for ML experiment tracking conversations.

Key Vocabulary

Run A single execution of a training script with a specific set of parameters. The atomic unit of experiment tracking. Each run logs its own metrics, parameters, and artifacts. Example: “I kicked off ten runs last night with different learning rates — the results are in the experiment dashboard.”

Experiment A named group of related runs, typically organized around a hypothesis or model version. Experiments provide the context for comparing runs. Example: “Let’s create a new experiment for the transformer-based approach so we don’t mix the results with our LSTM runs.”

Artifact Any file produced during a run — a trained model, a dataset, a confusion matrix, a serialized tokenizer. Artifacts are versioned and stored alongside the run metadata. Example: “The best checkpoint from each run is saved as an artifact in the model registry.”

Metric logging The practice of recording performance metrics at each step or epoch during training so they can be visualized and compared across runs. Example: “Make sure you’re logging validation loss at every epoch, not just at the end — it helps us catch overfitting early.”

Hyperparameter sweep A systematic search over a defined space of hyperparameter values (learning rate, batch size, dropout, etc.) to find the combination that performs best. Example: “We ran a hyperparameter sweep over 50 combinations and found that a learning rate of 3e-4 with dropout 0.2 gives the best F1 score.”

Model registry A centralized store where trained models are versioned, tagged, and managed through lifecycle stages (staging, production, archived). Acts as the handoff point between experimentation and deployment. Example: “Once the model passes evaluation, promote it to the model registry and tag it as ‘staging’ for the deployment team.”

Model versioning Tracking distinct iterations of a model — including weights, training data version, and configuration — so previous versions can be compared or rolled back to. Example: “We’re on model version 4.2 in production. Version 4.3 is in staging pending A/B test results.”

Champion/challenger model The champion is the currently deployed production model; the challenger is a new candidate being evaluated against it. Standard terminology for production model comparison. Example: “The challenger model has a 3% higher precision on the holdout set — let’s promote it to challenger in the A/B test.”

Baseline comparison Comparing a new model’s performance against a simple reference model (or the current production model) to determine whether the improvement is meaningful. Example: “Before we go further, let’s establish a baseline comparison against the logistic regression model — if the neural network doesn’t beat it significantly, it may not be worth the complexity.”

Common Phrases and Collocations

“Log the run to [platform]” The standard action phrase for recording experiment results. Example: “Don’t forget to log the run to W&B so the team can review the training curves.”

“Promote to the model registry” The action of moving a successfully evaluated model from experimentation to the registry for deployment. Example: “I’m going to promote this run to the model registry with the tag ‘production-candidate’.”

“Compare runs side by side” Standard workflow in experiment tracking UIs — selecting multiple runs and viewing their metrics in a parallel comparison view. Example: “Can you compare the last three runs side by side in MLflow? I want to see if the data augmentation is actually helping.”

“The experiment didn’t converge” Describes a training run that failed to find a good solution — loss didn’t decrease, or the model didn’t learn. Example: “The experiment didn’t converge — the loss is oscillating. I think the learning rate is too high.”

“We need to reproduce this run” Request to recreate an experiment exactly, using the same data, code version, and hyperparameters — a common requirement before promoting a model. Example: “We need to reproduce this run before we promote it — I want to confirm the metrics aren’t an artifact of a lucky seed.”

Practical Sentences to Practice

  1. “I ran a hyperparameter sweep over 40 combinations using W&B Sweeps — the results are in the ‘bert-finetune-v2’ experiment.”
  2. “The champion model has 87% accuracy; the challenger is at 89% on the same holdout set.”
  3. “Each run logs training loss, validation loss, and F1 per class to Comet ML automatically.”
  4. “Once you’ve registered the model, update the model card with the training data version and evaluation results.”
  5. “The baseline comparison shows our new model beats the previous production model by 4 percentage points on precision.”

Common Mistakes to Avoid

Confusing “run” and “experiment” A run is a single training execution. An experiment is the container for multiple related runs. Saying “I ran three experiments” when you mean “three runs within one experiment” causes confusion in team discussions. Say: “I have three runs in the ‘image-classifier-v3’ experiment.”

Using “accuracy” as the only metric in discussions In professional ML teams, single-metric reporting is a red flag. Always specify which metrics matter for the problem. Instead of: “The model is 95% accurate.” Say: “The model achieves 95% accuracy, but recall on the minority class is only 62% — that’s what we need to improve.”

Neglecting to mention the dataset version A model’s performance is only meaningful relative to the data it was trained and evaluated on. Always tie metrics to a specific dataset version or split. Instead of: “This model performs better.” Say: “This model achieves higher F1 on dataset v3.1, our current evaluation benchmark.”

Summary

ML experiment tracking vocabulary — runs, experiments, artifacts, sweeps, champion/challenger models — is the shared language of professional ML teams. Whether you’re using MLflow, Weights and Biases, or Comet ML, the underlying concepts and the English terms for discussing them are consistent. Master this vocabulary and you’ll be able to participate fully in experiment reviews, model handoffs, and ML design discussions.