5 exercises — essential vocabulary for data scientists, ML engineers, and analysts: model evaluation metrics, pipeline terminology, and the concepts you need to discuss AI systems in English.
Core ML vocabulary clusters
Model quality: overfitting, underfitting, generalisation, bias-variance trade-off
Evaluation: accuracy, precision, recall, F1 score, AUC-ROC, confusion matrix
A data scientist explains their work to a colleague. Which sentence correctly describes overfitting?
Overfitting means the model has learned the training data too well — it has memorised the noise and specific examples rather than generalising the underlying patterns. Result: excellent training accuracy, poor performance on test/production data. The opposite is underfitting: the model is too simple to capture patterns in the training data (option A describes this). Option C confuses overfitting with a data volume problem — overfitting is about model complexity vs. data, not data size per se. Option D partially relates (too many epochs can cause overfitting) but isn't the definition. Key vocabulary cluster: overfitting / underfitting → generalisation → train/validation/test split → cross-validation → regularisation (L1/L2) → dropout → early stopping. In practice: "The model achieved 98% accuracy on training data but only 72% on the test set — a clear sign of overfitting."
2 / 5
A team is evaluating a binary classifier for detecting fraudulent transactions. They note: "Our model flags very few legitimate transactions as fraud." Which metric does this statement describe?
Precision measures how many of the model's positive predictions are actually correct. "Few legitimate transactions flagged as fraud" = few false positives = high precision. Formula: Precision = TP / (TP + FP). Recall (option A) measures how many of the actual positives the model caught — "how many real fraud cases did we find?". Formula: Recall = TP / (TP + FN). The precision/recall trade-off is crucial in fraud detection: high precision = low false alarm rate; high recall = finding more real fraud but more false alarms. F1 score (option C) is the harmonic mean of precision and recall — useful when both matter. Accuracy (option D) = (TP + TN) / total — misleading when classes are imbalanced (e.g., 99% of transactions are legitimate — a model that always says "not fraud" gets 99% accuracy but 0% recall). Practical example: "We want high precision for the fraud alert system — false alarms cause customer frustration."
3 / 5
An ML engineer describes a system component: "This is the sequence of automated steps that takes raw data, transforms it, and produces model predictions at scale." What is the correct technical term?
An ML pipeline is an automated, end-to-end workflow that processes data and produces predictions. A typical ML pipeline includes: data ingestion → preprocessing → feature engineering → model training → evaluation → deployment → monitoring. "Pipeline" is used across many IT contexts but in ML/data engineering it specifically refers to this sequential, automated processing chain. The other terms: Data warehouse (option A) — a structured, query-optimised storage system for historical business data (Snowflake, BigQuery, Redshift). Feature store (option B) — a centralised repository for ML features, enabling reuse and consistency across models (Feast, Tecton, Hopsworks). Data lake (option D) — raw, unstructured or semi-structured data storage at scale (S3, ADLS, GCS). Key related terms: batch pipeline (processes data in chunks), streaming pipeline (processes data in real-time), ETL (Extract, Transform, Load), feature engineering (creating input variables from raw data).
4 / 5
A data scientist says: "We need to tune the learning rate, batch size, and regularisation strength before training." What are these parameters called?
Hyperparameters are settings configured before training begins — they control the learning process itself, not the learned patterns. Examples: learning rate, batch size, number of epochs, regularisation strength (λ), number of layers, dropout rate, kernel size. Model parameters (option A) are learned during training — the weights and biases in a neural network, the coefficients in linear regression. You don't set model parameters manually; the training algorithm finds them. Feature weights (option C) — a common term for model coefficients in linear models, but not the general term for pre-training settings. Training labels (option D) — the ground truth output values (y) in supervised learning. Hyperparameter tuning methods: grid search (try all combinations), random search (sample randomly), Bayesian optimisation, AutoML. Example: "After hyperparameter tuning with a learning rate of 0.001 and batch size 64, validation accuracy improved by 4%."
5 / 5
A data analyst reads this in a project requirement: "The model must explain which features contributed most to each prediction." Which concept does this requirement describe?
Model explainability (or interpretability) is the ability to understand and communicate why a model made a specific prediction. This is critical in high-stakes domains: healthcare, finance, hiring, and security — where decisions must be auditable and justifiable. Key explainability tools and methods: SHAP (SHapley Additive exPlanations) — assigns each feature a contribution to the prediction. LIME (Local Interpretable Model-Agnostic Explanations) — approximates the model locally with a simpler interpretable model. Feature importance — which features most influence predictions globally (XGBoost, Random Forest built-in). Attention weights — in transformer models, which input tokens the model attended to. Feature scaling (option C): normalising/standardising input features (min-max, z-score) — not the same as explaining predictions. Cross-validation (option D): technique to reliably estimate model performance using multiple train/test splits. In conversation: "The client needs explainability — they won't accept a black box for loan decisions."