English for MLflow Experiment Tracking

Learn the English vocabulary for MLflow: experiments, runs, artifacts, the model registry, and reproducibility in machine learning workflows.

MLflow conversations often mix up “experiment” and “run” — the container and the individual attempt inside it — which makes it hard to tell whether a teammate is asking about one training attempt or a whole line of investigation. Getting this distinction right is the foundation for talking about MLflow clearly.

Key Vocabulary

Experiment — a named collection of related runs, typically grouping all attempts at solving one modeling problem (like “churn-prediction”) so they can be compared side by side. “We’re logging every hyperparameter sweep under the same experiment so we can compare all 40 runs on one leaderboard, instead of hunting through separate, unrelated experiments.”

Run — a single execution of training or evaluation code, logged with its own parameters, metrics, and artifacts, nested inside an experiment. “This run used a learning rate of 0.01 and got 92% accuracy — compare that against the run right above it, which used 0.001 and did noticeably worse.”

Artifact — any output file logged alongside a run, such as a trained model file, a confusion matrix plot, or a preprocessing pipeline, stored so it can be retrieved later without rerunning the code. “The model artifact from this run is still available, so we can load it directly for inference without retraining anything.”

Model registry — MLflow’s system for versioning models, tracking which stage each version is in (staging, production, archived), and providing a single source of truth for which model version is currently live. “We promoted version 12 to ‘Production’ in the model registry, which is what the serving layer actually reads from — the notebook where it was trained is no longer the source of truth.”

Reproducibility — the property of a run being re-executable to produce the same result, which MLflow supports by logging the exact parameters, code version, and environment alongside metrics. “We couldn’t reproduce last month’s result until we noticed MLflow had logged a different library version for that run — the code hadn’t changed, but a dependency update had.”

Common Phrases

  • “Is this comparison across different runs in the same experiment, or are we comparing across separate experiments?”
  • “Is the model artifact for this run still available, or do we need to retrain to get it back?”
  • “What stage is this model version in the registry — staging, or is it actually promoted to production?”
  • “Can we reproduce this run’s result from the logged parameters and environment, or is something not being tracked?”
  • “Are we logging this as a new run under the existing experiment, or does this warrant its own experiment entirely?”

Example Sentences

Discussing a reproducibility issue: “We couldn’t reproduce the reported accuracy because the run had logged the model parameters but not the exact library versions — we’ve since added environment logging to every run to prevent this.”

Explaining a deployment decision: “We’re promoting version 15 to production in the model registry since it beat the current production model on every tracked metric across three separate evaluation runs.”

Describing an experiment structure in a review: “All of the fine-tuning attempts for this task live under one experiment so we can sort by validation loss and immediately see which run actually won, rather than digging through scattered notebooks.”

Professional Tips

  • Distinguish experiment from run consistently — saying “check the experiment” when you mean “check this specific run” sends a teammate looking in the wrong place.
  • Reference the model registry stage explicitly when discussing what’s actually live — a model existing in MLflow doesn’t mean it’s the one currently serving production traffic.
  • Say artifact specifically when asking for a stored output — it points teammates to exactly where a file lives rather than requiring a re-run.
  • Raise reproducibility gaps as a concrete finding, naming what wasn’t logged (environment, data version, seed) rather than describing results as vaguely “not reproducible.”

Practice Exercise

  1. Explain the difference between an experiment and a run.
  2. Describe what the model registry is used for and why it matters for deployment.
  3. Write a sentence explaining what needs to be logged for a run to be reproducible.