5 exercises — choose the best-structured answer to common Data Engineer interview questions. Focus on precise vocabulary, correct use of technical terms, and demonstrating real experience.
Structure for data engineering interview answers
Name the pattern: batch/streaming/lambda/kappa — explain the latency and complexity trade-off
Specify latency requirements: frame the choice in seconds, minutes, or hours depending on the use case
Address exactly-once semantics: mention Flink checkpointing or Kafka transactions for streaming correctness
Mention monitoring and alerting: freshness checks, volume anomaly detection, blocking vs non-blocking failures
0 / 5 completed
1 / 5
The interviewer asks: "When would you choose a streaming pipeline over a batch pipeline, and what architecture would you use?" Which answer best demonstrates pipeline architecture thinking?
Option B is strongest: it frames the decision on latency requirements and data arrival patterns (not just "real-time vs batch"), gives concrete examples for each, explains both Lambda and Kappa architectures with the trade-off (Lambda: accuracy but complexity; Kappa: simplicity but replay), covers exactly-once semantics as a critical streaming concern with specific tools (Flink checkpointing, Kafka transactions), and names the streaming-specific challenges (out-of-order events, windowing strategies). Key structure: latency requirement → use case examples → Lambda vs Kappa comparison → exactly-once semantics → streaming-specific challenges (windowing, reprocessing). Option C is accurate but does not explain exactly-once semantics or windowing. Option D is accurate but surface-level — does not cover the Kappa simplification rationale or exactly-once.
2 / 5
The interviewer asks: "How do you structure dbt models, and how does dbt enforce data quality?" Which answer best demonstrates dbt knowledge?
Option B is strongest: it describes each layer with the specific transformations that belong there (staging: rename/cast; intermediate: joins/dedup; mart: denormalised for BI), explains ref() with two benefits (environment-aware reference + DAG lineage), distinguishes generic from singular tests with specific test names, introduces freshness checks as a data propagation safeguard, and explains materialisation trade-offs with the decision logic. Key structure: three layers with specific transformations → ref() for lineage DAG → generic vs singular tests → freshness checks → materialisation trade-offs → dbt docs from same YAML. Option C is accurate but does not explain why each layer exists or the freshness check feature. Option D is too brief — it does not explain the layer separation rationale or materialisation.
3 / 5
The interviewer asks: "How do you implement data quality checks in a production data pipeline?" Which answer best covers a production-grade approach?
Option B is strongest: it names four distinct check layers with the reasoning for each, introduces data contracts as a producer-consumer governance mechanism (a senior-level concept), specifies a concrete anomaly detection threshold (20% row count drop, 7-day average), distinguishes blocking vs non-blocking failures (preventing alert fatigue), and specifies routing alerts to a dedicated channel. Key structure: four layers (schema → row-level → freshness/volume anomaly → business rules) → data contracts → concrete threshold examples → blocking vs non-blocking classification → alert routing. Option C is accurate but does not cover anomaly detection, data contracts, or the blocking/non-blocking distinction. Option D mentions Monte Carlo (a real tool) but does not explain the four-layer framework or data contracts.
4 / 5
The interviewer asks: "How do you choose a partitioning strategy for a large table in a cloud data warehouse?" Which answer best explains partitioning trade-offs?
Option B is strongest: it frames the decision on query patterns and cardinality, explains partition pruning as the mechanism, gives a specific reason for date partitioning beyond performance (retention via partition drop), explains hash partitioning with the skew-prevention rationale, warns about low-cardinality partitioning as a failure mode with a specific example (3-value column), distinguishes BigQuery clustering from Snowflake micro-partitioning, and names over-partitioning as the opposite pitfall. Key structure: query patterns + cardinality → pruning goal → date partitioning (ingestion cadence + retention) → hash partitioning (skew prevention) → low-cardinality pitfall → BigQuery vs Snowflake specifics → over-partitioning overhead. Option C is accurate but does not explain retention via partition drop or the low-cardinality pitfall. Option D mentions both pitfalls briefly but does not explain the reasoning behind hash partitioning or the BigQuery/Snowflake distinction.
5 / 5
The interviewer asks: "How do you design a reliable Airflow DAG for a production data pipeline?" Which answer best demonstrates Airflow orchestration expertise?
Option B is strongest: it leads with idempotency as the foundational principle (with specific implementation — UPSERT or partition overwrite), gives the rationale for task granularity (retry scope, not just modularity), explains deferrable sensors with the version and the specific benefit (releases worker slots), names the XCom size constraint with the correct alternative (S3/GCS), introduces dynamic task mapping with the Airflow 2.3 feature name (expand()), and mentions metrics export for monitoring. Key structure: idempotency with implementation detail → task granularity for retry scope → deferrable sensors (worker slot efficiency) → XComs for small state, S3/GCS for large → dynamic task mapping with expand() → retries/SLA → Prometheus/Datadog monitoring. Option C is accurate and covers deferrable operators but does not explain the XCom size pitfall or the idempotency implementation pattern. Option D does not explain deferrable sensors or dynamic task mapping.