5 exercises — covering feature pipeline design, training data quality, labeling pipelines, dataset versioning, and the difference between ML data engineering and traditional data engineering.
Structure for ML Data Engineer answers
Feature pipelines: training-serving skew is the most common failure mode — use the same code path for training and serving features
Data quality for ML: distribution drift matters as much as completeness — validate statistical properties, not just schema
Dataset versioning: version data and code together; a model is only reproducible if both are pinned
Labeling pipelines: inter-annotator agreement (Cohen's kappa) is the key quality metric; measure before scale
0 / 5 completed
1 / 5
The interviewer asks: "What is training-serving skew and how do you prevent it?" Which answer is most technically precise?
Option B is strongest. It names two root causes (code skew and data skew), explains the mechanism of each, and provides prevention strategies for each: shared feature computation via a feature store (the canonical solution for code skew), serving feature logging with distribution comparison (for data skew), and canary deployments for pipeline changes. Option A correctly identifies distribution mismatch but only suggests better data, not the code skew problem. Option C confuses data freshness with skew — you can have current data with severe code skew. Option D uses train-test split — this validates the training pipeline but does not address the serving pipeline difference.
2 / 5
The interviewer asks: "How do you validate data quality for ML training datasets? What is different compared to traditional data quality checks?" Which answer shows the most complete understanding?
Option C is the strongest. It names five ML-specific quality dimensions beyond traditional checks (statistical distribution, label quality, feature importance drift, data leakage, slice analysis), names specific tools (Great Expectations, TFX), and explains why each matters for model quality. Option A describes traditional data quality (nulls, duplicates, schema) — necessary but insufficient. Option B asserts ML data quality is the same as traditional — incorrect; statistical distribution and label quality are ML-specific. Option D uses profiling + human review — appropriate for exploration but not for a scalable, automated training pipeline.
3 / 5
The interviewer asks: "How do you version datasets for machine learning?" Which answer is most operationally complete?
Option B is strongest. It names three requirements (reproducibility, lineage, efficiency), describes specific techniques for reproducibility (DVC, Delta Lake time travel, Git SHA, random seed tracking), describes lineage graph with named tools (MLflow, Vertex ML Metadata), and addresses the efficiency problem (logical versioning via Delta Lake/Iceberg instead of physical copies). Option A uses timestamp folders — no lineage, no code version tracking, manual spreadsheet breaks at scale. Option C names DVC but without the surrounding lineage and efficiency considerations. Option D dismisses versioning — "fresh data each time" means model comparisons are meaningless without knowing what data each model trained on.
4 / 5
The interviewer asks: "You need to build a labeling pipeline for 500,000 images. How do you design it?" Which answer is most complete?
Option C is strongest. It describes five components: IAA measurement before scale (with specific metrics and threshold), guideline development tied to IAA, consensus mechanisms via platform features, an active learning loop that reduces labeling cost by 40-60%, and continuous quality monitoring via a golden set. Option A divides work without IAA measurement or consensus mechanisms — quality is unknown until the random sample at the end. Option B focuses on speed via crowdsourcing without quality design. Option D outsources without describing the QA interface or IAA measurement — valid as an execution choice but not as an engineering design.
5 / 5
The interviewer asks: "How do ML data engineering responsibilities differ from traditional data engineering?" Which answer is most precise?
Option B is strongest. It identifies five specific ML-specific responsibilities that extend traditional data engineering (feature stores with training-serving consistency, dataset versioning for reproducibility, two pipeline types, ML-specific data quality, feedback loop pipelines), explains why each is ML-specific rather than general, and acknowledges the skill intersection. Option A focuses only on data volume — wrong; the requirements differ qualitatively. Option C focuses on language tools — the technology choice is secondary to the responsibility difference. Option D says ML data engineering is a subset of data science — incorrect; it is a specialisation of data engineering that interfaces with data science but has distinct engineering responsibilities.