Vocabulary for Data Engineers
Essential data engineering vocabulary explained in plain English: ETL vs ELT, data lakehouse, dbt, orchestration, data lineage, data quality — with examples.
Data engineering has undergone a significant transformation in the last five years. The vocabulary has evolved alongside the tooling — from traditional ETL pipelines to modern lakehouse architectures, from hand-rolled transformations to framework-driven analytics engineering. For non-native English speakers working in data, this evolving vocabulary can be confusing.
This guide covers the most important data engineering terms, with clear definitions and examples of how they are used in professional conversations and documentation.
Data Movement Patterns
ETL (Extract, Transform, Load)
In the traditional ETL pattern, data is extracted from source systems, transformed (cleaned, joined, aggregated) outside the destination, and then loaded into a data warehouse.
“Our legacy pipeline follows an ETL pattern — data is transformed in a staging area before being loaded into the warehouse.”
ELT (Extract, Load, Transform)
ELT reverses the transformation step: data is loaded into the destination first, then transformed using the compute power of the warehouse itself. This is the dominant pattern in modern cloud data stacks.
“We switched to an ELT pattern when we moved to BigQuery — raw data lands in the landing zone, and dbt handles all transformations inside the warehouse.”
Data Pipeline
A data pipeline is a sequence of steps that move and transform data from source to destination.
“The ingestion pipeline runs every hour, pulling events from Kafka and landing them in S3.”
Storage Architecture
Data Warehouse
A data warehouse is a structured, relational system optimised for analytical queries. Examples: Snowflake, BigQuery, Redshift.
“Business analysts query the data warehouse directly using SQL.”
Data Lake
A data lake is a large repository of raw data in its native format — structured, semi-structured, and unstructured — stored cheaply in object storage.
“We dump all application logs and clickstream data into the data lake in S3 before processing.”
Data Lakehouse
A data lakehouse combines the low-cost storage of a data lake with the querying and governance capabilities of a data warehouse. It is the architecture behind platforms like Databricks and Apache Iceberg.
“We’re migrating to a lakehouse architecture using Apache Iceberg — it gives us ACID transactions on top of S3 storage.”
Transformation and Modelling
dbt (Data Build Tool)
dbt (pronounced “dee-bee-tee”) is an open-source framework that allows data analysts and engineers to write SQL-based transformations, apply version control, and manage the dependency graph between models.
“All our transformation logic lives in dbt models. When we run
dbt build, it executes the entire DAG from raw to mart.”
“We write dbt tests for every mart table — row count checks, not-null assertions, and uniqueness tests.”
Data Model
A data model describes the structure and relationships of data. In analytics, it typically refers to a SQL transformation that shapes raw data into a useful form.
“The
ordersmodel joins theraw_orderstable with thecustomersdimension to produce a denormalised analytical model.”
DAG (Directed Acyclic Graph)
A DAG represents the dependency graph of a data pipeline — which tasks must complete before others can start.
“The dbt DAG shows that the
ordersmart depends on four upstream models.”
Orchestration
Orchestration
Orchestration is the scheduling and coordination of pipeline tasks, managing dependencies and retry logic.
“We use Airflow for orchestration — it manages the scheduling of our 200+ DAG tasks and handles retries and alerting.”
Apache Airflow
Apache Airflow is the most widely used open-source orchestration platform for data pipelines.
“The Airflow DAG runs at 02:00 UTC each morning, triggering the ingestion, transformation, and export jobs in sequence.”
Data Quality and Governance
Data Lineage
Data lineage tracks where data came from, how it was transformed, and where it flows to. It is essential for debugging and compliance.
“The lineage graph shows that the revenue metric in the dashboard is derived from the
orderstable in the warehouse, which was last updated six hours ago.”
“We use dbt’s built-in lineage visualisation to trace data from raw source to BI tool.”
Data Quality
Data quality refers to the accuracy, completeness, consistency, and timeliness of data.
“We run data quality checks after every pipeline run — null counts, duplicate checks, and referential integrity tests.”
Data Catalogue
A data catalogue is a searchable inventory of data assets — tables, columns, owners, and descriptions — that helps analysts find and understand data.
“The data catalogue shows that the
customer_idcolumn in the orders table matches theidin the customers table.”
Schema Evolution
Schema evolution is the management of changes to the structure of data over time — adding columns, renaming fields, changing data types.
“We use Apache Iceberg because it handles schema evolution gracefully — we can add columns without rewriting the entire table.”
Useful Phrases for Data Engineering Discussions
Discussing Architecture
- “We’re evaluating a move to a lakehouse pattern to reduce the duplication between our data lake and warehouse.”
- “The ELT approach lets us load raw data quickly and iterate on transformations without re-ingesting.”
- “We use dbt for all our transformation logic — it gives us version control, testing, and lineage out of the box.”
Discussing Data Quality
- “The pipeline has a data quality gate — if any test fails, the downstream models don’t run.”
- “We track freshness SLOs for every mart table — the orders mart must be updated within two hours of source data landing.”
Discussing Incidents
- “The dashboard was showing stale data because the Airflow task failed silently at 03:00. The lineage helped us trace which models were affected.”
Data engineering vocabulary is increasingly standardised across the industry, largely driven by the modern data stack (dbt, Snowflake/BigQuery, Airflow, Fivetran). Learning these terms will help you read documentation, contribute to design discussions, and communicate more precisely with data analysts, analytics engineers, and data scientists.