ML Infrastructure Engineer
ML Infrastructure Engineers build and maintain the systems that allow data scientists and ML engineers to train, evaluate, and serve models at scale. Their daily English covers writing infrastructure architecture documents, presenting GPU utilization reports, discussing training pipeline bottlenecks, and explaining serving infrastructure choices to ML and product teams. This path builds the vocabulary for discussing model training, serving, and observability infrastructure.
Topics covered
- GPU & compute infrastructure
- Training pipeline engineering
- Model serving
- Feature stores
- ML platform reliability
- Distributed training
Vocabulary spotlight
4 terms every ML Infrastructure Engineer should know in English:
The proportion of a GPU's compute capacity that is actively being used — a key efficiency metric for training infrastructure, as idle GPU time is expensive
"We improved GPU utilization from 45% to 78% by overlapping data loading with forward passes using prefetching."
A centralized repository for machine learning features — enables teams to share, discover, compute, and serve features consistently across training and inference
"By routing all feature computation through the feature store, we eliminated training/serving skew for the recommendation model."
A system that tracks model versions, metadata, evaluation metrics, and deployment history — the source of truth for which model version is in production
"The model registry enforces a sign-off workflow before any model can be tagged for production deployment."
A class of model degradation where the features used during training differ from those computed at inference time, causing the model to underperform in production
"We traced the performance drop to a training/serving skew — the normalization logic differed between the training pipeline and the serving API."
📚 Vocabulary Reference
Key terms organised by category for ML Infrastructure Engineers:
Training Infrastructure
Serving Infrastructure
MLOps Platform
Reliability
Recommended exercises
Real-world scenarios you'll practise
- Writing a GPU infrastructure capacity proposal: justifying a cluster expansion with utilization data, model training projections, and cost-per-experiment analysis
- Presenting a training/serving skew incident postmortem: explaining root cause, impact, and the monitoring improvements that prevent recurrence
- Designing a feature store architecture: explaining the trade-offs between online and offline stores, point-in-time correctness, and backfill strategies
- Writing a model serving infrastructure runbook: documenting scaling policies, rollback procedures, and health check configurations for an inference fleet