5 exercises — practice structuring strong English answers to data engineering interview questions: ETL vs ELT, data lineage, pipeline reliability, partitioning, and real-time event processing.
How to structure data engineering interview answers
ETL/ELT questions: define both → state the decision criterion (compute location) → mention re-transformation from raw data → name tools (dbt, Informatica)
Lineage questions: three values — trust, impact analysis, compliance → name column-level vs table-level → cite dbt and catalog tools
Reliability questions: organise by layer — infrastructure, quality, orchestration, observability → idempotency is always relevant
Partitioning questions: define partition pruning → quantify the benefit → name clustering → give the anti-pattern (high-cardinality)
Real-time questions: name all three layers → give tool decision criteria → address exactly-once vs at-least-once + watermarks
0 / 5 completed
1 / 5
The interviewer asks: "What is the difference between ETL and ELT, and when would you choose one over the other?" Which answer demonstrates the clearest data engineering thinking?
Option B is the strongest: it gives precise definitions for both, explains the reasoning behind each choice (compute location, cost, auditability), names specific scenarios where ETL is still preferable (limited compute, sensitive data masking), and grounds the answer in real cloud warehouse tools and practices. The ETL vs ELT decision framework: ETL preferred when: destination has limited compute (on-prem warehouse, legacy system); sensitive PII must be masked/anonymised before landing in the warehouse; strict schema enforcement at ingest. ELT preferred when: cloud warehouse with elastic compute (BigQuery, Snowflake, Redshift); you want to preserve raw data for re-transformation; fast iteration on business logic without re-extracting. Key ELT vocabulary:Raw layer / bronze layer — landing zone for unmodified source data. Staging/silver layer — cleaned, deduplicated, typed. Mart/gold layer — business-ready aggregations. dbt (data build tool) — standard tool for SQL-based transformations in ELT; version-controls models, auto-generates lineage. Medallion architecture — bronze/silver/gold = raw/cleaned/aggregated. Options C and D are reasonable but lack the security/governance reason for ETL and the re-transformation argument's connection to business logic correction.
2 / 5
The interviewer asks: "What is data lineage, and why does it matter in a data engineering context?" Which answer best demonstrates depth?
Option B is the strongest: it precisely defines lineage with both its flow and transformation dimensions, gives three distinct value propositions with specifics (trust, impact analysis, compliance), cites real regulatory drivers (GDPR, HIPAA), mentions column-level lineage (a key advanced feature), and names multiple tooling categories. The three reasons lineage matters in interviews: 1. Data trust / discoverability — data consumers (analysts, data scientists) need to understand where a metric comes from to trust it. Lineage makes this self-service. 2. Impact analysis (blast radius) — "If I change the definition of active_user in the source table, what breaks?" Without lineage, this is a scary manual investigation. With lineage, you see the dependency graph instantly. 3. Compliance / data governance — GDPR right-to-erasure requires knowing every copy of personal data. HIPAA requires audit trails for patient data. Column-level lineage (tracking individual columns through transformations) vs. table-level lineage: column-level is much more powerful but harder to maintain. dbt provides it automatically for SQL models. The tooling categories: Transformation-level lineage: dbt. Platform-wide lineage / data catalog: Apache Atlas, OpenMetadata, DataHub, Collibra, Alation.
3 / 5
The interviewer asks: "How do you ensure reliability in a data pipeline?" Which answer best demonstrates production data engineering experience?
Option B is the strongest: it organises reliability into four distinct layers (infrastructure, data quality, orchestration, observability), names specific techniques under each with precise justifications (MERGE not INSERT, why), names multiple tools per layer, and adds the critical insight that "a pipeline that runs but produces stale data is as bad as a failed pipeline" — this shows production maturity. Key data pipeline reliability concepts: Idempotency — you can run the same task multiple times and get the same result. Critical for safe retries. Use partition-based overwrites or MERGE statements instead of appending, which would duplicate rows. Schema evolution handling — sources change schemas without warning. Fail fast at ingest (schema validation) so you know immediately, rather than propagating corrupted data silently. Data quality checks — dbt tests (not_null, unique, accepted_values, referential_integrity); Great Expectations or Soda for more complex assertions. SLAs on pipelines — "The warehouse should be refreshed by 8am for the business to use." Airflow and Prefect both support SLA miss callbacks. Data freshness monitoring — separately from pipeline success; track when data was last updated to detect stale-but-not-failed pipelines. Dead-letter queue — a holding area for records that failed processing, so they don't block the pipeline and can be re-processed or investigated. Options C and D are solid but organised less clearly, and option D is missing the "stale data" insight.
4 / 5
The interviewer asks: "Explain the concept of data partitioning and its impact on query performance." Which answer best demonstrates SQL and warehouse optimization knowledge?
Option B is the strongest: it gives a quantified example (10TB table, 7/365 scan), explains write efficiency improvement (not just read), differentiates partitioning from clustering, names both tools (BigQuery clustering vs. Snowflake micro-partitioning), and — critically — names an anti-pattern (over-partitioning on high-cardinality columns) which shows production experience. Partitioning interview vocabulary: Partition pruning — the query planner's ability to skip irrelevant partitions based on query filters. Only works when you filter on the partition column. Partition column choices: Date/timestamp columns are most common for event data (daily partitions). Low-to-medium cardinality categoricals (region, tenant_id) work well. High-cardinality columns (user_id, order_id) are anti-patterns — too many tiny partitions. Clustering vs partitioning: Partitioning = physical separation of files. Clustering = data within each partition is sorted by cluster column(s), so the engine can seek to relevant row groups without scanning the whole partition. Both can be combined. Write patterns: Partitioned tables enable partition-by-partition incremental loads — overwrite yesterday's partition without touching the rest of the table. This is the ELT incremental load pattern. Cost model relevance: In BigQuery (bytes-scanned billing) and Redshift Spectrum (bytes-scanned billing), partition pruning directly reduces costs. In Snowflake (credits per virtual warehouse time), it reduces query duration. Options C and D are correct but don't include write efficiency, the quantified example, or the anti-pattern.
5 / 5
The interviewer asks: "Walk me through how you would design a data pipeline for real-time event processing." Which answer demonstrates the clearest system design thinking?
Option B is the strongest: it structures the answer into clearly labelled layers, gives decision criteria for choosing between tools (not just naming them), addresses three serving layer scenarios with different latency/use-case profiles, and covers four advanced design considerations (exactly-once vs at-least-once, late events/watermarks, schema evolution, back-pressure) — each of which is a real production concern. Real-time pipeline vocabulary: Message broker / event bus — Kafka, Kinesis, Pub/Sub. Decouples producers from consumers, enables replay, provides durability. Consumer group — multiple consumers sharing partition load; used for horizontal scaling. Consumer lag — how far behind a consumer is from the latest offset. High lag = processing can't keep up with ingestion (back-pressure problem). Stream processing semantics:At-most-once — events may be lost (fire and forget). At-least-once — events may be processed multiple times (requires idempotent consumers). Exactly-once — hardest; Flink achieves via distributed snapshots (Chandy-Lamport algorithm). Watermarks — a time threshold after which the engine assumes all events for a given window have arrived. Events arriving after the watermark are "late" and handled separately. Back-pressure — when downstream processing is slower than upstream ingestion; Flink handles this natively by slowing the source. Options C and D are competent but lack the decision framework between tools and the back-pressure discussion.