Advanced 16 terms

Machine Learning Engineering

Training loop, inference, feature engineering, gradient descent, overfitting, A/B testing, and essential vocabulary for ML engineers.

  • Training Loop /ˈtreɪnɪŋ luːp/

    The iterative process of feeding mini-batches of data through a model, computing loss, backpropagating gradients, and updating weights — repeated for multiple epochs until convergence.

    "The training loop runs for 50 epochs with early stopping — if validation loss hasn't improved for 5 consecutive epochs, training halts to prevent overfitting. We log loss and accuracy to MLflow after each epoch."
  • Inference /ˈɪnfərəns/

    Using a trained model to generate predictions on new, unseen data — as opposed to training. Inference workloads have different constraints: low latency, high throughput, and cost efficiency.

    "Training runs overnight on 8 A100 GPUs. Inference is served by a quantised ONNX model on CPU instances — latency is 45ms p99, cost is 80% lower than GPU serving, and throughput meets the 200 req/s SLO."
  • Feature Engineering /ˈfiːtʃər ˌendʒɪˈnɪərɪŋ/

    The process of transforming raw data into input representations that improve model performance — including normalisation, encoding, aggregation, and domain-specific transformations.

    "Raw timestamps performed poorly as model inputs. Feature engineering transformed them into hour-of-day, day-of-week, is_weekend, and days-since-signup — these temporal features improved click-through-rate prediction by 12%."
  • Gradient Descent /ˈɡreɪdiənt dɪˈsent/

    An optimisation algorithm that iteratively adjusts model weights in the direction of the negative gradient of the loss function, minimising loss over training iterations. Variants: SGD, Adam, AdaGrad.

    "We switched from SGD to Adam — Adam's adaptive learning rates per parameter made training converge 3x faster on our sparse text features. The learning rate schedule decays by 0.1 every 10 epochs."
  • Overfitting /ˌəʊvəˈfɪtɪŋ/

    When a model learns the training data too well — memorising noise and edge cases — resulting in high training accuracy but poor generalisation to unseen data. Detected by a gap between training and validation loss.

    "The model achieved 98% training accuracy but only 72% on validation — a classic overfitting signature. We added L2 regularisation, increased dropout to 0.4, and used data augmentation. Validation accuracy improved to 89%."
  • Underfitting /ˌʌndəˈfɪtɪŋ/

    When a model is too simple to capture the underlying patterns in the data — producing high loss on both training and validation sets. Often caused by insufficient model capacity or too-aggressive regularisation.

    "A linear model underfitted the non-linear customer churn problem — training loss was high and validation loss matched. We switched to a gradient-boosted tree model with higher capacity and the training loss dropped significantly."
  • Cross-Validation /krɒs ˌvælɪˈdeɪʃən/

    A resampling technique for evaluating model generalisation — splitting data into k folds, training on k-1 folds, and evaluating on the held-out fold, then averaging metrics across all k runs.

    "5-fold cross-validation gave us a reliable accuracy estimate of 84.3% ± 1.2% — the ± indicates the model is stable across different data splits. This gave us confidence before committing to the final training run."
  • Hyperparameter Tuning /ˈhaɪpəˌpærəmɪtər ˈtjuːnɪŋ/

    The process of searching for optimal model configuration values (learning rate, batch size, number of layers, regularisation strength) that are set before training begins and not learned from data.

    "We ran 200 hyperparameter tuning trials using Optuna — Bayesian optimisation explored the search space efficiently. Optimal config: lr=3e-4, batch_size=256, dropout=0.3 — validation accuracy improved from 81% to 88%."
  • Precision / Recall / F1 /prɪˈsɪʒən / rɪˈkɔːl / ef wʌn/

    Precision: fraction of positive predictions that are correct. Recall: fraction of actual positives correctly identified. F1: harmonic mean of precision and recall, balancing both metrics.

    "The fraud detection model has precision 0.91 and recall 0.73 — it misses 27% of actual fraud (false negatives) but only 9% of its fraud alerts are false alarms. We tuned the threshold to improve recall at the cost of precision — missing fraud is more costly than investigating false alerts."
  • ROC-AUC /ɑːr əʊ siː ɔːk/

    Receiver Operating Characteristic — Area Under the Curve. Plots true positive rate against false positive rate at all classification thresholds. AUC = 1.0 is perfect; 0.5 is random. A threshold-independent model quality metric.

    "The model achieved ROC-AUC of 0.93 on the test set — strong discriminative ability across all decision thresholds. We use AUC for model selection because it's robust to class imbalance, unlike accuracy."
  • Feature Store /ˈfiːtʃər stɔː/

    A centralised repository for storing, versioning, and serving precomputed features — ensuring training and serving pipelines use the same feature transformations, preventing training-serving skew.

    "The feature store centralises user_30d_purchase_count and product_avg_rating — the training pipeline and real-time inference API both read from it. Before the feature store, a discrepancy between training and serving features caused a silent 4% accuracy drop."
  • Training-Serving Skew /ˈtreɪnɪŋ ˈsɜːvɪŋ skjuː/

    A discrepancy between how features are computed during training versus serving — a common cause of models that perform well offline but poorly in production.

    "The recommendation model had training AUC of 0.89 but production precision dropped to 0.71. Root cause: training used 7-day rolling average views, serving computed 30-day — training-serving skew. The feature store enforced a single definition for both."
  • A/B Testing (ML context) /eɪ biː ˈtestɪŋ/

    Controlled experiment comparing a baseline model (control) against a new model (treatment) by routing a percentage of live traffic to each — measuring the causal impact on business metrics with statistical significance.

    "Model v2 was rolled out to 20% of traffic in an A/B test. After 14 days we reached statistical significance: treatment group showed +8.3% click-through rate (p=0.001, 95% CI: +6.1% to +10.5%). We ramped to 100%."
  • Data Drift / Concept Drift /ˈdeɪtə drɪft / ˈkɒnsept drɪft/

    Data drift: the distribution of input features shifts over time (e.g., user behaviour changes). Concept drift: the relationship between inputs and outputs changes (e.g., what constitutes fraud evolves). Both degrade model performance silently.

    "The model's precision dropped from 0.87 to 0.74 over three months. PSI analysis detected data drift in the income feature (PSI > 0.25) — post-pandemic income distributions shifted significantly from the training data. We triggered a retraining pipeline."
  • SHAP Values /ʃæp ˈvæljuːz/

    SHapley Additive exPlanations — a game-theory-based method for explaining individual model predictions by computing each feature's contribution to the output. Produces consistent, locally accurate explanations.

    "The loan rejection model now generates SHAP explanations for each decision — the applicant sees that their debt-to-income ratio (+0.34) and recent missed payment (+0.28) were the primary factors. Required for GDPR Art. 22 compliance for automated decisions."
  • MLOps /em el ɒps/

    The practice of applying DevOps principles to ML workflows: automating training pipelines, model versioning, deployment, monitoring, and retraining — enabling reliable, reproducible, and auditable ML systems in production.

    "The MLOps platform automates the full lifecycle: data validation → training → evaluation → model registry → canary deployment → drift monitoring → retraining trigger. A model update that previously took 3 weeks of manual work now runs end-to-end in 6 hours."