5 exercises — choose the best-structured answer to common ML infrastructure interview questions. Focus on feature pipelines, GPU efficiency, model serving, and drift monitoring.
Structure for ML infrastructure design answers
Separate training from serving: pipelines and SLAs differ significantly
Name components precisely: feature store, model registry, serving runtime, monitoring
Address data and model drift: they are distinct and need separate detection strategies
0 / 5 completed
1 / 5
The interviewer asks: "Design an online feature pipeline for a real-time personalisation system serving 50,000 requests per second." Choose the answer that covers the critical design dimensions.
Option B is strongest: it introduces the three-tier feature classification with concrete latency budgets per tier, names specific technologies (Flink, Spark Structured Streaming, Redis), addresses the train-serve skew problem (the critical interview topic — same job writes both stores), covers Redis hot-key scaling, and includes feature monitoring. Option D's pre-compute approach fails for online features requiring last 5 minutes of activity — that data doesn't exist until it happens. Option C names the right architecture without implementation depth. Feature pipeline design: three-tier classification → computation tech per tier → serving latency budget → train-serve consistency mechanism → hot-key handling → monitoring.
2 / 5
The interviewer asks: "Our GPU cluster is at 45% utilisation. How would you investigate this and what would you do about it?" Choose the most systematic diagnostic answer.
Option D is strongest: it structures the investigation systematically (classify idle vs. compute first), names specific profiling tools (nvidia-smi dmon, PyTorch Profiler, Nsight Compute), identifies six distinct root causes with specific fixes for each, and includes multi-GPU and scheduling dimensions beyond single-GPU optimisation. Option B correctly identifies data loading as the most common cause but assumes the diagnosis without instrumenting first — premature optimisation. Option A also skips to a conclusion. GPU investigation: classify idle vs. compute → data pipeline profiling → memory bandwidth → batch size → multi-GPU overhead → scheduling gaps.
3 / 5
The interviewer asks: "Compare batch inference, online inference, and streaming inference — when would you use each?" Which answer covers the key design considerations for each pattern?
Option A is strongest: it defines each pattern with latency characteristics and concrete use cases, names specific infrastructure for each, and identifies the unique challenge per pattern — batch (parallelisation), online (cold start + p99), streaming (model versioning across rolling updates — the hardest challenge). Option D names the right technologies but doesn't explain use cases or the unique challenges of each pattern. Inference pattern comparison: latency × use case × infrastructure × unique challenge for each of the three modes.
4 / 5
The interviewer asks: "How do you detect and respond to model drift in production?" Choose the most complete monitoring and response strategy.
Option C is strongest: it distinguishes three drift types with separate detection and response strategies, names specific tests with threshold values (PSI > 0.2, KS p < 0.05), names real tooling (Evidently AI, WhyLogs, Arize), addresses the delayed label problem (concept drift is hard to detect without timely ground truth), defines a tiered response playbook, and specifies canary deployment for the retrained model. Option B is technically correct but shallow — no detection methods, tooling, or tiered response. Drift monitoring: three types → per-type detection tests and thresholds → tooling → concept drift proxy signals → delayed label strategy → tiered response → canary deployment.
5 / 5
The interviewer asks: "How would you reduce the cost of a training job currently taking 8 hours on 32 A100 GPUs?" Which answer gives the most practical cost-reduction strategy?
Option B is strongest: it mandates profiling first (the universally correct first step), provides eight concrete techniques with estimated impact (mixed precision 1.5–2× throughput, spot instances 60–90% cheaper), lists the checkpoint-and-resume requirement for spot instances, and includes the short hyperparameter sweep strategy before a full run. Option D's distillation requires training a separate model and doesn't directly reduce the current 8-hour job; ZeRO may already be in use and isn't guaranteed to reduce runtime. Training cost reduction: profile first → mixed precision → checkpointing → gradient accumulation → data pipeline → spot instances + checkpointing → parallelism efficiency → short ablation sweeps.