Advanced 6 topic areas 25+ exercises

ML Infrastructure Engineer

ML Infrastructure Engineers build and maintain the systems that allow data scientists and ML engineers to train, evaluate, and serve models at scale. Their daily English covers writing infrastructure architecture documents, presenting GPU utilization reports, discussing training pipeline bottlenecks, and explaining serving infrastructure choices to ML and product teams. This path builds the vocabulary for discussing model training, serving, and observability infrastructure.

Topics covered

  • GPU & compute infrastructure
  • Training pipeline engineering
  • Model serving
  • Feature stores
  • ML platform reliability
  • Distributed training

Vocabulary spotlight

4 terms every ML Infrastructure Engineer should know in English:

GPU utilization n.

The proportion of a GPU's compute capacity that is actively being used — a key efficiency metric for training infrastructure, as idle GPU time is expensive

"We improved GPU utilization from 45% to 78% by overlapping data loading with forward passes using prefetching."
feature store n.

A centralized repository for machine learning features — enables teams to share, discover, compute, and serve features consistently across training and inference

"By routing all feature computation through the feature store, we eliminated training/serving skew for the recommendation model."
model registry n.

A system that tracks model versions, metadata, evaluation metrics, and deployment history — the source of truth for which model version is in production

"The model registry enforces a sign-off workflow before any model can be tagged for production deployment."
training/serving skew n.

A class of model degradation where the features used during training differ from those computed at inference time, causing the model to underperform in production

"We traced the performance drop to a training/serving skew — the normalization logic differed between the training pipeline and the serving API."
Open full glossary →

📚 Vocabulary Reference

Key terms organised by category for ML Infrastructure Engineers:

Training Infrastructure

GPU clusterdistributed trainingdata parallelismmodel parallelismgradient checkpointingmixed precisionNCCLInfiniBandspot instancepreemption

Serving Infrastructure

inference serverTritonTensorRTONNXmodel quantizationbatchingdynamic batchinglatency SLOthroughputGPU sharing

MLOps Platform

feature storemodel registryexperiment trackingMLflowW&BKubeflowMetaflowpipeline orchestrationdata versioningDVC

Reliability

training/serving skewdata driftconcept driftmodel degradationshadow deploymentcanary evaluationchampion/challengerrollbackA/B evaluationmonitoring
Study full vocabulary modules →

Recommended exercises

Real-world scenarios you'll practise

  • Writing a GPU infrastructure capacity proposal: justifying a cluster expansion with utilization data, model training projections, and cost-per-experiment analysis
  • Presenting a training/serving skew incident postmortem: explaining root cause, impact, and the monitoring improvements that prevent recurrence
  • Designing a feature store architecture: explaining the trade-offs between online and offline stores, point-in-time correctness, and backfill strategies
  • Writing a model serving infrastructure runbook: documenting scaling policies, rollback procedures, and health check configurations for an inference fleet

Recommended reading

Explore another role

🗄️ Data Platform Engineer

Open path →