Interview English for ML Engineers: Discussing Training, Evaluation and Deployment
Master the English vocabulary and phrases for ML engineering interviews: explaining model training, evaluation metrics, deployment pipelines, and trade-offs.
Machine learning engineering interviews test both technical depth and the ability to communicate complex ideas clearly. For non-native English speakers, the challenge is double: know the concepts and explain them fluently under pressure. This guide gives you the language for both.
Talking About Model Training
Explaining the training process
- “I trained the model on a dataset of 2 million labelled examples using supervised learning.”
- “We fine-tuned a pre-trained BERT model on our domain-specific corpus.”
- “The model converged after 50 epochs, with validation loss stabilising around 0.23.”
- “We used early stopping to prevent overfitting — training halts when validation loss stops improving.”
Describing hyperparameter tuning
- “I ran a grid search over learning rate and batch size.”
- “We used Bayesian optimisation to find the optimal hyperparameter configuration.”
- “The learning rate was decayed using a cosine annealing schedule.”
- “We ablated each feature to understand its contribution to the final score.”
Handling interviewer follow-ups
Interviewer: “Why did you choose that learning rate?”
“That’s a good question. I started with 1e-4 as a baseline — that is a common starting point for fine-tuning transformers — and then used a warmup schedule for the first 10% of training to stabilise the loss before the full learning rate kicked in.”
Discussing Evaluation
Choosing the right metric
A common interview question is: “How did you evaluate your model?”
“The choice of metric depends on the problem type. For this binary classification task with a class imbalance, accuracy would have been misleading — so we used F1 score, which balances precision and recall. We also tracked AUC-ROC to evaluate performance across all classification thresholds.”
Metric vocabulary
| Metric | When to use it | How to explain it |
|---|---|---|
| Accuracy | Balanced classes, general overview | ”The fraction of correct predictions.” |
| Precision | Cost of false positives is high | ”Of all predicted positives, how many are correct?” |
| Recall | Cost of false negatives is high | ”Of all actual positives, how many did we catch?” |
| F1 score | Imbalanced classes | ”Harmonic mean of precision and recall.” |
| AUC-ROC | Threshold-independent comparison | ”Area under the ROC curve — 0.5 is random, 1.0 is perfect.” |
| RMSE | Regression tasks | ”Root mean squared error — penalises large errors.” |
| BLEU / ROUGE | Text generation | ”N-gram overlap between generated and reference text.” |
Explaining overfitting and underfitting
- “The model overfit to the training data — it memorised the examples rather than generalising.”
- “We detected overfitting by observing a large gap between training and validation loss.”
- “The model underfitted — it was too simple to capture the signal in the data.”
- “We regularised the model using dropout and L2 weight decay.”
Talking About Deployment
Describing the deployment pipeline
- “We containerised the model using Docker and deployed it behind a REST API.”
- “The model is served using TorchServe with autoscaling on Kubernetes.”
- “We use shadow mode to test the new model in production without affecting users.”
- “The model is updated via a blue/green deployment — we shift 10% of traffic, monitor, then ramp up.”
Model versioning and registry
- “We track all experiments in MLflow and register the best-performing model to the model registry.”
- “Each model version is tagged with the training dataset version and evaluation results.”
- “We promote a model from staging to production after it passes quality gates.”
Latency and throughput
- “The model’s p99 latency is 45ms, which is within our SLA.”
- “We batch inference requests to improve GPU utilisation.”
- “We use quantisation to reduce the model size and improve inference speed.”
- “The model was distilled from a larger teacher model to reduce serving cost.”
Answering “Tell Me About a Challenging ML Problem”
Structure your answer using the STAR method adapted for ML:
- Situation: “We were building a churn prediction model for a subscription product.”
- Task: “The challenge was severe class imbalance — only 2% of users churned in any given month.”
- Action: “I addressed the imbalance using stratified sampling and SMOTE for oversampling. I also switched from accuracy to AUC-ROC as the primary metric.”
- Result: “The final model achieved an AUC of 0.87, and the business was able to reduce churn by 15% through targeted interventions.”
Common Interview Questions and Vocabulary Patterns
| Question | Key phrase to use |
|---|---|
| ”How would you handle missing data?" | "I would impute missing values using [median/KNN/model-based] imputation, depending on the missingness pattern." |
| "What is the bias-variance trade-off?" | "High bias means the model underfits; high variance means it overfits. We tune model complexity to balance both." |
| "How do you prevent data leakage?" | "I separate the train, validation, and test sets before any preprocessing steps." |
| "What is a confusion matrix?" | "A table showing true positives, false positives, true negatives, and false negatives.” |
Key Takeaways
- Training vocabulary: trained, fine-tuned, converged, ablated, decayed, warmup schedule.
- Evaluation vocabulary: choose metrics based on the problem — F1 for imbalance, AUC-ROC for threshold-free comparison.
- Deployment vocabulary: served, containerised, shadow mode, blue/green, model registry, quantised, distilled.
- Use the STAR method for behavioural ML questions: Situation, Task, Action, Result.
- Explain your metric choice — interviewers often test whether you know why you chose a metric, not just what it is.