Vocabulary for Machine Learning Engineers

The essential English vocabulary for machine learning engineers — model training, evaluation metrics, MLOps, and deployment terms explained with examples.

Machine learning has its own dense vocabulary — combining statistics, software engineering, and data science. Whether you are discussing a model’s performance in a meeting, writing a technical report, or preparing for a machine learning interview, knowing the right terms in English is essential.


Model Training Vocabulary

TermDefinitionExample sentence
training dataData used to teach the model”We trained the model on 500,000 labelled examples.”
validation dataData used to tune hyperparameters during training”We monitor loss on the validation set to detect overfitting.”
test dataHeld-out data used only for final evaluation”We evaluate on the test set only once, after training is complete.”
epochOne full pass through the training data”We trained for 50 epochs with early stopping.”
batch sizeNumber of samples processed in one training step”We use a batch size of 32 — larger batches tend to generalise worse on this dataset.”
learning rateHow fast the model updates its weights”We started with a learning rate of 0.001 and used decay after 20 epochs.”
overfittingModel performs well on training data but poorly on new data”The training accuracy was 98% but validation accuracy was 74% — a clear sign of overfitting.”
underfittingModel is too simple to capture the patterns”The model underfits — we need a more expressive architecture.”
regularisationTechniques to prevent overfitting”We applied L2 regularisation and dropout to reduce overfitting.”
fine-tuningAdapting a pre-trained model to a specific task”We fine-tuned BERT on our domain-specific corpus.”

Evaluation Metrics Vocabulary

“Accuracy alone is misleading here because the dataset is highly imbalanced. We should prioritise precision and recall.”

MetricDefinitionWhen to use it
accuracyFraction of correct predictionsWhen classes are balanced
precisionOf all positive predictions, how many were correctWhen false positives are costly
recallOf all actual positives, how many were foundWhen false negatives are costly
F1 scoreHarmonic mean of precision and recallWhen you need a balance of both
AUC-ROCArea Under the ROC Curve — model’s ability to distinguish classesBinary classification problems
lossThe error signal used during trainingAll supervised learning tasks
RMSERoot Mean Square Error — for regression tasks”The model achieved an RMSE of 4.2 on the test set.”
MAEMean Absolute ErrorRegression; easier to interpret than RMSE
perplexityMeasure of how well a language model predicts textNLP tasks
BLEU scoreSimilarity between generated and reference textMachine translation, summarisation

Model Architecture Vocabulary

TermDefinition
neural networkA model inspired by the structure of the brain, made of layers of connected units
layerA building block of a neural network (e.g., convolutional layer, attention layer)
parameterA learned weight inside the model
hyperparameterA setting configured before training (learning rate, depth, batch size)
transformerThe architecture underlying modern LLMs; introduced in “Attention Is All You Need” (2017)
attention mechanismA component that allows the model to focus on relevant parts of the input
embeddingA dense vector representation of a piece of data (word, image patch, etc.)
gradient descentThe optimisation algorithm used to train most neural networks
backpropagationHow gradients flow backward through the network to update weights

MLOps Vocabulary

“We use an MLflow tracking server to log experiments and compare runs. Successful models are promoted to the model registry and deployed via a blue-green deployment strategy.”

TermDefinition
MLOpsDevOps practices applied to machine learning — training, deployment, monitoring
experiment trackingRecording parameters, metrics, and artefacts from training runs
model registryA central store for versioned, validated models
feature storeA centralised repository of engineered features for training and inference
data driftWhen the distribution of production data diverges from training data
model degradationDecline in model performance over time
pipelineA sequence of automated steps from raw data to deployed model
inferenceUsing a trained model to make predictions on new data
latencyTime taken to return a prediction
throughputNumber of predictions the model can make per second
shadow modeRunning a new model alongside the old one for comparison before switching
A/B testingSplitting traffic between two models to compare performance in production

Phrases for ML Discussions

Describing model performance:

“The model achieves 89% accuracy on the test set, which is a 12-point improvement over our baseline.”

“The precision is high, but recall is low — the model is conservative in making positive predictions.”

Describing training decisions:

“We experimented with three architectures. The transformer-based model outperformed the others on validation, so we proceeded with that.”

“We stopped training at epoch 40 based on early stopping — validation loss had plateaued for five epochs.”

Describing deployment:

“The model is served via a REST API with a P99 latency of 45ms.”

“We’re monitoring for data drift on a weekly cadence. If the drift score exceeds the threshold, we trigger a retraining pipeline.”


This vocabulary will help you participate confidently in ML team meetings, write clear model documentation, and communicate results to stakeholders who need to understand what the numbers actually mean.