Question 1

What is the difference between ETL and ELT in data engineering?

Accepted Answer

ETL (Extract, Transform, Load) transforms data before loading it into the target system, while ELT (Extract, Load, Transform) loads raw data first and performs transformations inside the warehouse. Modern cloud data warehouses like BigQuery and Snowflake favour ELT because their compute power makes in-warehouse transformation both fast and cost-effective.

Question 2

How do data engineers describe data lineage in technical discussions?

Accepted Answer

Data lineage refers to the ability to trace the origin, movement, and transformation of data throughout a pipeline. Engineers typically say a dataset has full lineage when every upstream dependency and transformation step is documented and visualised, often through tools like dbt's DAG or Apache Atlas.

Question 3

What vocabulary is used to discuss data quality dimensions?

Accepted Answer

The five core data quality dimensions are completeness, accuracy, consistency, timeliness, and validity. In team discussions, engineers use phrases like the dataset fails our completeness threshold or we have a timeliness SLA of two hours for this table to communicate quality expectations precisely.

Question 4

What does idempotent pipeline mean and why does it matter?

Accepted Answer

An idempotent pipeline produces the same result regardless of how many times it is run with the same input. This property is critical for safe re-runs after failures — engineers describe a pipeline as idempotent when reprocessing a day's data does not create duplicate records or corrupt aggregates.

Question 5

How do you explain the difference between a data warehouse and a data lakehouse?

Accepted Answer

A data warehouse stores structured, processed data optimised for analytical queries, while a data lake stores raw data in its native format at low cost. A lakehouse combines both approaches by adding a metadata and governance layer on top of object storage, enabling ACID transactions and schema enforcement without sacrificing flexibility.

Question 6

What language do engineers use to discuss batch vs. streaming data processing?

Accepted Answer

Batch processing refers to collecting data over a period and processing it as a group on a schedule, while streaming processes each event as it arrives in near-real-time. Engineers say a pipeline is latency-sensitive when it requires streaming, and throughput-optimised when batch processing is sufficient.

Question 7

What is a slowly changing dimension and how is it discussed in schema design?

Accepted Answer

A slowly changing dimension (SCD) is a dimension table in a data warehouse where attribute values change infrequently over time. Engineers distinguish between SCD Type 1 (overwrite), Type 2 (add a new row to preserve history), and Type 3 (add a column for the previous value) when designing dimension tables.

Question 8

How do data teams communicate about pipeline orchestration tools?

Accepted Answer

Orchestration tools like Apache Airflow and Dagster schedule and monitor pipeline execution, managing dependencies between tasks as directed acyclic graphs (DAGs). Engineers use terms like DAG run, task instance, upstream dependency, and sensor when discussing pipeline scheduling and failure handling.

Question 9

What does data contract mean in a modern data stack context?

Accepted Answer

A data contract is a formal agreement between data producers and consumers specifying the schema, semantics, and quality guarantees of a dataset. It typically defines field names, data types, freshness SLOs, and allowed null rates, giving downstream teams confidence that breaking changes will be communicated proactively.

Question 10

How do engineers discuss Kafka consumer groups and partition assignment?

Accepted Answer

In Apache Kafka, a consumer group is a set of consumers that collectively read from a topic, with each partition assigned to exactly one consumer in the group at a time. Engineers say a topic is over-partitioned when there are more partitions than consumers, and under-partitioned when consumers cannot scale horizontally.

📊 Data Engineering Language

Data Pipeline Vocabulary

Data Warehouse Vocabulary

Streaming Data Language

Data Quality Vocabulary

dbt & Modern Data Stack Vocabulary

Data Incident Communication

Frequently Asked Questions