📊 Data Engineering Language
6 exercise sets. Master the vocabulary for data pipelines, warehouses, streaming systems, and data quality discussions.
Data Pipeline Vocabulary
ETL vs. ELT, batch vs. streaming, pipeline stages, data lineage — core vocabulary for data pipeline discussions.
Data Warehouse Vocabulary
Fact table, dimension table, slowly changing dimension, star schema, OLAP vs. OLTP.
Streaming Data Language
Apache Kafka vocabulary: topic, partition, consumer group, offset, exactly-once semantics.
Data Quality Vocabulary
Data quality dimensions: completeness, accuracy, consistency, timeliness — vocabulary for data quality discussions.
dbt & Modern Data Stack Vocabulary
dbt model, ref(), source freshness, lineage, data contract vocabulary.
Data Incident Communication
Data pipeline failure communication, SLA breach, data freshness SLO, incident vocabulary for data teams.
Frequently Asked Questions
What is the difference between ETL and ELT in data engineering?
ETL (Extract, Transform, Load) transforms data before loading it into the target system, while ELT (Extract, Load, Transform) loads raw data first and performs transformations inside the warehouse. Modern cloud data warehouses like BigQuery and Snowflake favour ELT because their compute power makes in-warehouse transformation both fast and cost-effective.
How do data engineers describe data lineage in technical discussions?
Data lineage refers to the ability to trace the origin, movement, and transformation of data throughout a pipeline. Engineers typically say a dataset has "full lineage" when every upstream dependency and transformation step is documented and visualised, often through tools like dbt's DAG or Apache Atlas.
What vocabulary is used to discuss data quality dimensions?
The five core data quality dimensions are completeness, accuracy, consistency, timeliness, and validity. In team discussions, engineers use phrases like "the dataset fails our completeness threshold" or "we have a timeliness SLA of two hours for this table" to communicate quality expectations precisely.
What does "idempotent pipeline" mean and why does it matter?
An idempotent pipeline produces the same result regardless of how many times it is run with the same input. This property is critical for safe re-runs after failures — engineers describe a pipeline as idempotent when reprocessing a day's data does not create duplicate records or corrupt aggregates.
How do you explain the difference between a data warehouse and a data lakehouse?
A data warehouse stores structured, processed data optimised for analytical queries, while a data lake stores raw data in its native format at low cost. A lakehouse combines both approaches by adding a metadata and governance layer (such as Delta Lake or Apache Iceberg) on top of object storage, enabling ACID transactions and schema enforcement without sacrificing flexibility.
What language do engineers use to discuss batch vs. streaming data processing?
Batch processing refers to collecting data over a period and processing it as a group on a schedule, while streaming processes each event as it arrives in near-real-time. Engineers say a pipeline is "latency-sensitive" when it requires streaming, and "throughput-optimised" when batch processing is sufficient.
What is a "slowly changing dimension" and how is it discussed in schema design?
A slowly changing dimension (SCD) is a dimension table in a data warehouse where attribute values change infrequently over time — for example, a customer's address or job title. Engineers distinguish between SCD Type 1 (overwrite the old value), Type 2 (add a new row to preserve history), and Type 3 (add a column for the previous value) when designing dimension tables.
How do data teams communicate about pipeline orchestration tools?
Orchestration tools like Apache Airflow and Dagster schedule and monitor pipeline execution, managing dependencies between tasks as directed acyclic graphs (DAGs). Engineers use terms like "DAG run", "task instance", "upstream dependency", and "sensor" when discussing pipeline scheduling and failure handling in Airflow.
What does "data contract" mean in a modern data stack context?
A data contract is a formal agreement between data producers and consumers specifying the schema, semantics, and quality guarantees of a dataset. It typically defines field names, data types, freshness SLOs, and allowed null rates, giving downstream teams confidence that breaking changes will be communicated proactively rather than discovered in production.
How do engineers discuss Kafka consumer groups and partition assignment?
In Apache Kafka, a consumer group is a set of consumers that collectively read from a topic, with each partition assigned to exactly one consumer in the group at a time. Engineers say a topic is "over-partitioned" when there are more partitions than consumers, and "under-partitioned" when consumers cannot scale horizontally because the partition count is too low.