Data Engineering Vocabulary: 70 Terms Every Data Engineer Must Know
Essential data engineering vocabulary: ETL, ELT, DAGs, data pipelines, Spark, Kafka, dbt, data contracts, lakehouse architecture, and 60 more terms with examples.
Data engineers build the pipelines that move, transform, and store data at scale. The vocabulary spans distributed systems, SQL, orchestration, cloud storage, and data modelling — with heavy overlap with DevOps and architecture. This guide covers the 70 terms you’ll encounter daily as a data engineer.
Data Movement Patterns
ETL (Extract, Transform, Load)
ETL is the traditional data integration pattern: extract data from source systems, transform it (clean, join, aggregate), then load it into a data warehouse.
“Our ETL pipeline runs nightly — it extracts from the transactional database, normalises the schema, then loads into Redshift.”
ELT (Extract, Load, Transform)
ELT reverses the order: extract raw data, load it directly into a data warehouse or lake, then transform using SQL inside the warehouse. Enabled by modern cloud warehouses (BigQuery, Snowflake).
“We switched to ELT — raw data lands in BigQuery first, then dbt handles all transformations.”
CDC (Change Data Capture)
CDC captures row-level changes (INSERT, UPDATE, DELETE) in a source database and streams them downstream. Avoids full table scans. Common tools: Debezium, AWS DMS.
Incremental Load
An incremental load processes only new or changed records since the last run — faster and cheaper than a full load. Requires a reliable watermark (updated-at timestamp, sequence ID).
Full Load / Full Refresh
A full load replaces the entire target table with a fresh extract from the source. Simpler but expensive for large tables.
Pipelines & Orchestration
Pipeline
A data pipeline is a sequence of steps that move and transform data from source to destination. Steps can include extraction, cleaning, enrichment, aggregation, and loading.
DAG (Directed Acyclic Graph)
A DAG is a pipeline structure where tasks are nodes and dependencies are directed edges — with no cycles (a task cannot depend on a downstream task). Apache Airflow and Prefect model pipelines as DAGs.
“The quarterly report DAG has 12 tasks — three extract jobs, eight transforms, and one load to the warehouse.”
Orchestrator
An orchestrator manages DAG scheduling, retries, alerting, and backfill. Examples: Apache Airflow, Prefect, Dagster, Mage. The orchestrator does not execute data processing — it triggers workers.
Backfill
A backfill is re-running pipeline tasks for historical time periods. Common when a pipeline is first deployed or when a bug is fixed.
“After fixing the currency conversion bug, we backfilled the last 90 days of transaction data.”
Idempotency
An idempotent pipeline produces the same result regardless of how many times it runs for the same input. Critical for safe retries and backfills.
SLA (Pipeline SLA)
A pipeline SLA defines the maximum acceptable time from data availability at the source to its availability in the warehouse. a breach triggers alerts and escalation.
Data Storage Architecture
Data Warehouse
A data warehouse is a centralised analytical store optimised for querying structured, cleaned, historical data. Typically columnar. Examples: Snowflake, BigQuery, Redshift, Databricks SQL.
Data Lake
A data lake stores raw, unprocessed data in its native format (CSV, JSON, Parquet, Avro) at low cost. Supports structured, semi-structured, and unstructured data.
Lakehouse
A lakehouse combines the low-cost storage of a data lake with the query performance and ACID transactions of a data warehouse. Examples: Delta Lake (Databricks), Apache Iceberg, Apache Hudi.
Data Mart
A data mart is a subject-specific subset of a warehouse optimised for a particular team’s queries — e.g., the marketing data mart or the finance data mart.
OLTP vs. OLAP
- OLTP (Online Transactional Processing) — optimised for high-volume, low-latency read/write transactions. PostgreSQL, MySQL.
- OLAP (Online Analytical Processing) — optimised for complex aggregations over large datasets. Snowflake, BigQuery.
Columnar Storage
Columnar storage stores data column-by-column rather than row-by-row. Dramatically faster for analytical queries that aggregate specific columns across millions of rows.
Parquet
Apache Parquet is the dominant open-source columnar file format for data engineering. Compressed, schema-embedded, and supported by all major processing engines.
Data Modelling
Schema
A schema defines the structure of data — table names, column names, data types, and constraints. In data warehousing, schema also refers to a logical namespace grouping tables.
Star Schema
A star schema organises a data warehouse around a central fact table (measurements) surrounded by dimension tables (context: date, product, customer). Optimised for analytical queries.
Fact Table
A fact table records business events and measurements — sales transactions, page views, clicks. Each row is an event with numeric measures (amount, quantity) and foreign keys to dimension tables.
Dimension Table
A dimension table provides descriptive attributes for facts — customer name, product category, date hierarchy. Dimensions are used for filtering, grouping, and labelling in queries.
Slowly Changing Dimension (SCD)
An SCD is a dimension whose attributes change over time. Type 1 overwrites the old value. Type 2 adds a new row with an effective date (preserves history). Type 3 adds a new column.
Normalisation vs. Denormalisation
Normalisation eliminates redundancy (good for OLTP). Denormalisation accepts redundancy for query performance (good for OLAP). Data warehouses are typically denormalised.
dbt (Data Build Tool)
dbt is the dominant SQL-based transformation tool for the modern data stack. Engineers write SQL SELECT statements; dbt compiles them into full CREATE TABLE AS SELECT statements.
Key dbt concepts:
- Model — a SQL file representing a transformed table or view
- Ref —
{{ ref('model_name') }}references another model, building the dependency graph automatically - Test — built-in data quality tests (not null, unique, accepted values, relationships)
- Source — a raw table declared in YAML, used as the starting point for transformations
- Lineage — dbt auto-generates the DAG of model dependencies, visualised in the docs
“We have a three-layer dbt project: staging (light cleaning), intermediate (joins), and marts (business-ready tables).”
Medallion Architecture
The medallion architecture organises a data lake into three layers:
- Bronze — raw ingested data
- Silver — cleaned and validated
- Gold — aggregated, business-ready tables
Distributed Processing
Apache Spark
Apache Spark is the dominant large-scale data processing engine. It uses an in-memory, distributed computation model — far faster than Hadoop MapReduce for iterative workloads. Supports Python (PySpark), Scala, SQL.
DataFrame
A DataFrame is Spark’s primary data abstraction — a distributed, tabular dataset with named columns and schema. Familiar to pandas users.
Partition
A partition is a chunk of a DataFrame distributed across worker nodes. Partition count determines parallelism. Too few → bottleneck; too many → coordination overhead.
Shuffle
A shuffle is the most expensive Spark operation — redistributing data across partitions during joins or aggregations. Minimising shuffles is a key Spark optimisation technique.
Streaming
Apache Kafka
Apache Kafka is the dominant event streaming platform. Producers publish messages to topics; consumers read from topics. Messages are retained for a configurable period, enabling replay.
Topic / Partition (Kafka)
A Kafka topic is a named log of messages. Topics are divided into partitions for parallelism. Each partition has an ordered, immutable sequence of messages with an offset.
Consumer Group
A consumer group allows multiple consumers to read from a topic in parallel, each consuming different partitions. Enables horizontal scaling of stream processing.
Stream Processing
Stream processing continuously processes data as it arrives — rather than in batches. Tools: Apache Flink, Kafka Streams, Spark Structured Streaming.
Event-Driven Architecture
Event-driven architecture treats state changes as events published to a stream. Downstream services subscribe to relevant events and react asynchronously.
Data Quality & Governance
Data Contract
A data contract is a formal, versioned agreement between data producers and consumers defining schema, semantics, SLAs, and ownership.
“The payments team broke our pipeline when they renamed a column — we’re implementing data contracts to prevent silent schema changes.”
Data Lineage
Data lineage tracks the origin and transformation history of data — where it came from, what happened to it, and where it went. Essential for debugging, compliance, and impact analysis.
Data Catalogue
A data catalogue is a searchable inventory of all datasets — tables, columns, owners, descriptions, and lineage. Examples: DataHub, Apache Atlas, Alation.
Schema Registry
A schema registry enforces schema compatibility for Kafka messages and Avro/Protobuf files — ensures consumers don’t break when producers change schemas.
Data Quality Checks
Data quality checks validate data against rules: not null, uniqueness, referential integrity, value ranges. Run in dbt tests, Great Expectations, or Soda.
Data Mesh
Data mesh is a decentralised architecture where domain teams own and publish their data as products. Federated governance ensures discoverability and quality standards.
Useful Phrases
In pipeline reviews:
- “The ELT job is not idempotent — if it fails mid-run and retries, we get duplicate rows.”
- “This partition strategy causes data skew — the largest partition is 10× the average.”
In data governance discussions:
- “We need a data contract for the events topic — the schema has broken twice this quarter without notice.”
- “The lineage graph shows that deleting this table would break 14 downstream models.”
In stakeholder updates:
- “The pipeline met its SLA — data was available in the warehouse by 06:00 UTC.”
- “We’re seeing a 20% increase in late-arriving data from the mobile events stream — we’re adjusting our watermark window.”
Practice
Test your data engineering vocabulary with the Data Engineering exercise set — 5 exercises covering pipeline, warehouse, and orchestration terminology.
Explore the full Data Engineer learning path for exercises, interview prep, and writing scenarios.