The interviewer asks: "What is a feature store and why does every mature ML platform eventually need one?" Which answer is most complete?
Option B is strongest. It structures the answer around three named problems that feature stores solve — this is the correct framing for a system design interview because it shows you understand the motivation, not just the tool. The training-serving skew section is precise about the mechanism: different implementation languages (Python vs. Java/C++) with subtle edge case differences create distribution drift that is hard to detect until model performance degrades. The feature reuse section quantifies the problem ("user 30-day purchase count computed 5 times") to make the waste concrete. The point-in-time correctness section correctly characterises it as a temporal join problem. The architecture section is exact: offline store is S3/BigQuery/Parquet for batch, online store is Redis/DynamoDB for < 10ms serving — and correctly names four production solutions. Feature store vocabulary:Training-serving skew — the difference between feature distributions at training time and serving time. Feature registry — a searchable catalogue of feature definitions with ownership and documentation. Point-in-time lookup — retrieving the feature value as it existed at a specific past timestamp. Offline store — historical feature value storage for training. Online store — current feature value storage for low-latency inference. Options C and D are accurate but frame the answer as a list rather than a motivated problem-solution structure.
2 / 5
The interviewer asks: "What is point-in-time correctness in feature engineering and how do you implement it?" Which answer is most rigorous?
Option B is strongest. It opens with a precise definition that explicitly names the failure mode (model leaks future information and overestimates production performance) — the "why it matters" that many candidates skip. The naive join failure section uses a concrete, memorable example (churn prediction model using post-churn feature values) that makes the abstraction tangible. The SQL snippet for the temporal join is the correct implementation — a correlated subquery or as-of join pattern — and the comment that feature stores implement this natively (sorted merge join) explains why a feature store is not just convenience but a performance necessity (ad-hoc temporal joins on large feature tables are prohibitively expensive). The TTL section introduces a third dimension of point-in-time correctness that most candidates miss (stale features). The three leakage types at the end provide a taxonomy. Point-in-time correctness vocabulary:Temporal join / as-of join — a join that finds the latest record before a given timestamp. Feature timestamp — the time at which a feature value was computed or observed. TTL (Time-to-Live) — the maximum age of a feature value before it is considered stale. Data leakage — using information unavailable at serving time during training, inflating model performance. Label leakage — when a feature is derived from or correlated with the label. Options C and D are accurate but lack the SQL implementation and the concrete churn prediction example.
3 / 5
The interviewer asks: "How do you detect and prevent training-serving skew in a production ML system?" Which answer is most systematic?
Option B is strongest. The three root causes framework (computation / freshness / pipeline) is the correct taxonomy — most candidates know about distribution drift but miss the computation skew (different code) and pipeline skew (null handling) causes. The shadow serving technique for computation skew detection is a production pattern used at companies like Uber and Airbnb for exactly this problem. The statistics section is precise: PSI > 0.2 threshold (the standard industry threshold), KL divergence, and KS test with their appropriate use cases. The null rate comparison (serving null rate > 2× training null rate triggers alert) is a specific operationally tested heuristic. The unified monitoring framework (Kafka → S3 → nightly jobs → PSI dashboard) shows end-to-end system thinking. Training-serving skew vocabulary:Population Stability Index (PSI) — a measure of how much a feature distribution has shifted; PSI > 0.2 indicates significant drift. Shadow serving — computing two feature values simultaneously at serving time to detect implementation divergence. Feature transformation logging — recording the exact preprocessed feature vector passed to the model at serving time. Kolmogorov-Smirnov test — a non-parametric test comparing two continuous distributions. Pipeline skew — differences in preprocessing logic between training and serving pipelines. Options C and D list the root causes correctly but lack the PSI threshold value and the shadow serving mechanism.
4 / 5
The interviewer asks: "What is feature lineage, why does it matter, and how do you implement it in a feature platform?" Which answer is most complete?
Option B is strongest. It defines lineage as a DAG (the correct data structure) and then motivates it with four specific operational reasons — this is the answer structure that senior interviewers expect because it shows the candidate understands WHY lineage exists, not just what it is. The impact analysis section makes the failure mode concrete: silent feature corruption that surfaces as model degradation days later (a specific production horror story that resonates with anyone who has operated ML systems). The GDPR right to erasure section is a compliance angle many candidates miss and is operationally important for consumer-facing ML systems. The retraining automation section shows how lineage enables proactive rather than reactive ML operations. The implementation section correctly names four specific components including column-level lineage (more precise than table-level) and the topological sort API. Feature lineage vocabulary:Feature lineage DAG — a directed acyclic graph tracing feature provenance from source to consumption. Column-level lineage — tracking which specific source columns contribute to a derived feature. DataHub / Apache Atlas / OpenLineage — data catalog and lineage tracking platforms. Right to erasure — GDPR requirement to delete personal data and prove deletion propagated through all derived datasets. Automated retraining trigger — a pipeline that initiates model retraining when upstream data dependencies change. Options C and D are accurate but lack the concrete failure mode examples and the GDPR erasure mechanism.
5 / 5
The interviewer asks: "When would you use streaming feature computation versus batch computation, and what are the engineering trade-offs?" Which answer is most nuanced?
Option B is strongest. The three-factor decision framework (freshness / complexity / cost) is the correct structured approach. The batch computation section adds the concrete freshness threshold ("1-hour staleness is acceptable") and the materialisation flow (batch → offline store → online store), which shows end-to-end pipeline thinking. The streaming section gives two concrete use cases with specific freshness requirements (fraud: 5-minute window; recommendation: 1-minute staleness hurts quality), making the decision criteria actionable rather than abstract. The trade-offs section introduces exactly-once semantics as a correctness requirement for aggregation features — a subtle but critical point (at-least-once processing overcounts). The backfill complexity trade-off is the most practically important operational concern for teams adopting streaming features, and most candidates do not mention it. The recommendation (default to batch, add streaming only for < 5-minute freshness) gives a concrete decision rule. Streaming vs. batch vocabulary:Watermarking — a Flink/Spark Structured Streaming mechanism for handling late-arriving events. Exactly-once semantics — the guarantee that each event is processed exactly once, not duplicated. Backfill — retroactively computing feature values for historical timestamps to generate training data. Unified computation framework — a system that runs the same feature definition logic on both streaming and batch inputs. Late-arriving events — events that arrive after the expected processing window, causing potential recalculation. Options C and D list the trade-offs correctly but lack the exactly-once semantics explanation and the backfill complexity rationale.