Data Engineer

Complete English Guide for Data Engineers

ETL/ELT pipeline communication, data modeling discussions, stakeholder reporting, SQL review vocabulary, data quality language, and the precise English of modern data platforms — dbt, Spark, Kafka, and the data lakehouse.

8 sections · 25+ internal practice links · Intermediate – Advanced

Why English Matters for Data Engineers

Data engineering is a discipline that sits at the intersection of software engineering, analytics, and the business domain. A data engineer builds the infrastructure that every other team depends on for insights — and that means communicating constantly across two very different audiences: the technical colleagues who will use and maintain the pipelines, and the business stakeholders who need to trust the data that flows through them.

The vocabulary of modern data engineering is dense with specialist terminology: ETL and ELT, data lakes and data lakehouses, warehouse schemas and fact tables, dbt models and materialisation strategies, Kafka topics and consumer groups, Spark jobs and partitioning strategies. When you are in an English-language meeting discussing schema design, troubleshooting a pipeline failure, or presenting a data quality report to a business analyst, you need to be able to use these terms precisely and fluently.

English is the dominant language of the data engineering ecosystem. dbt documentation, Databricks tutorials, Kafka design docs, Apache Spark guides, and the vast majority of data conference talks (Data Engineering Podcast, Locally Optimistic, current events at Data Council) are all in English. Being able to read and contribute to discussions in these communities without a language barrier is a significant professional advantage.

The sections below cover the specific English vocabulary and communication patterns that data engineers need: the language of pipeline design and ETL, data modeling discussions, presenting data to stakeholders, reviewing SQL, managing data quality, and communicating in agile sprints. The final section covers the vocabulary most likely to arise in technical interviews at data-engineering-focused companies.

Section 1: ETL/ELT Pipeline Communication

The Extract-Transform-Load (ETL) and Extract-Load-Transform (ELT) patterns are the vocabulary framework for almost every data pipeline discussion. Being precise about which pattern you are using — and why — is an important engineering communication skill.

Describing the ETL/ELT Pattern

ETL pipelines process data before loading it: "We extract raw clickstream events from the Kafka topic, apply transformations to normalise the timestamp format and filter out bot traffic, and load the clean records into the Redshift fact table." ELT reverses the order: "We load raw API responses into the data lake first, and run the transformations later using dbt models inside the warehouse itself. This gives us the flexibility to rerun transformations without re-ingesting the source data."

Key ETL/ELT vocabulary: ingest (to bring raw data into a system), transform (to reshape or clean data), load (to write data to a destination), backfill (to reprocess historical data), orchestrate (to schedule and coordinate pipeline steps), idempotent (safe to rerun — produces the same result). Example: "The pipeline is idempotent — if it fails and we rerun it, the output is identical. We achieve this by upserting on the event ID rather than appending."

Discussing Pipeline Architecture

When designing or reviewing a pipeline architecture, the vocabulary becomes more architectural. "We need to decide between a batch pipeline and a streaming pipeline. Batch gives us simpler operations and easier backfilling, but introduces latency. Streaming with Kafka gives us near-real-time data but significantly increases operational complexity." Common architecture discussion phrases: "This pipeline has a single point of failure at the ingestion layer." / "We need to add a dead-letter queue for records that fail validation — right now they are silently dropped." / "The transformation step is tightly coupled to the source schema — any change to the upstream API will break it."

Talking about pipeline monitoring: "We monitor the pipeline using Airflow's task duration metrics and set an SLA on the daily load job — if it hasn't completed by 6 a.m., we get paged." / "The data freshness SLA for this table is 4 hours — stakeholders expect the data to be current within 4 hours of the source system." / "We alert on data volume anomalies — if the row count drops more than 20% compared to the 7-day average, we assume the upstream source has an issue."

Practice these skills

Section 2: Data Modeling Discussion Vocabulary

Data modeling discussions require a precise vocabulary for describing the structure, relationships, and trade-offs of a data design. Whether you are designing a star schema in a traditional warehouse, a medallion architecture in a lakehouse, or a dbt project structure, you need to be able to articulate your decisions clearly.

Warehouse Schema Language

The traditional data warehouse vocabulary centres on dimensional modeling: "We're building a star schema. The central fact table holds order events — each row is one line item with a quantity and a revenue amount. The fact table is surrounded by dimension tables for customers, products, dates, and geographic regions." Key terms: fact table (a table of measurements or events), dimension table (a table of descriptive attributes), grain (the level of detail of a row in a fact table), slowly changing dimension (a dimension where attributes change over time), surrogate key (a system-generated primary key, separate from the business key).

When discussing grain: "Before we design this fact table, we need to agree on the grain. Are we storing one row per order, or one row per order line item? The difference has a huge impact on how business users can aggregate the data." When discussing slowly changing dimensions: "The customer dimension needs Type 2 SCD treatment — we need to track historical addresses because we calculate regional revenue attribution using the address at the time of the order, not the current address."

Lakehouse and dbt Vocabulary

Modern data platforms have introduced new vocabulary. The medallion architecture (bronze / silver / gold layers) describes data refinement: "The bronze layer holds raw, unprocessed data exactly as it arrived from the source. Silver applies basic cleaning and standardisation. Gold contains business-ready aggregates designed for specific analytical use cases."

dbt (data build tool) introduces its own terminology: "We're building the customer lifetime value metric as a dbt model. It's a view in the silver layer that joins the orders model, the payments model, and the customer dimension. We'll materialise it as a table on the first run and then use incremental materialisation to update it on subsequent runs, processing only the new orders from the last 3 days." Key dbt terms: model (a SQL file that defines a transformation), materialisation (how a model is persisted — view, table, incremental, ephemeral), ref() (a dbt function that references another model, building the DAG), test (a dbt assertion on the data, e.g., uniqueness or not-null), lineage (the dependency graph of models).

Section 3: Stakeholder Reporting Vocabulary

Data engineers frequently present work to business stakeholders — analysts, product managers, finance teams, and executives — who need to understand the impact of data infrastructure decisions without needing to understand the technical implementation. This requires translating technical vocabulary into business language.

Presenting Data Freshness and Reliability

Stakeholders care about whether they can trust the data and whether it is recent enough to make decisions. Communicating this requires specific vocabulary: "The orders table is updated on a 4-hour refresh cycle — the data in the dashboard is always at most 4 hours old. If you need real-time figures for intra-day decisions, we would need to build a streaming pipeline, which would be a 3-week engineering effort." / "We have a data quality monitoring framework that runs validation checks on every pipeline load. If any check fails — for example, if a primary key is duplicated, or if the record count drops more than 20% — the pipeline stops and alerts the on-call engineer. This means that when data is present in the dashboard, it has passed our quality gates."

When a pipeline fails and stakeholders notice stale data: "The daily load job failed at 2 a.m. due to a schema change in the upstream CRM system. We identified the issue at 6 a.m. and the fix has been deployed. The pipeline is currently backfilling the missed period — we expect the data to be fully up to date by 10 a.m. We will send a postmortem note by end of day." This kind of clear, structured communication — acknowledging the problem, explaining the cause, and giving a recovery timeline — is essential for maintaining stakeholder trust.

Communicating Migration and Infrastructure Changes

When data engineers are proposing a migration (e.g., moving from a Hadoop cluster to Databricks, or from a data warehouse to a lakehouse), stakeholders need to understand the business impact: "We are proposing migrating the analytics platform from Redshift to Databricks Delta Lake. The migration will take approximately six weeks. During the transition, the existing dashboards will continue to be served from Redshift. At the cutover point, all historical data will have been migrated, and new queries will run against Delta Lake. The primary benefit for the business is a 3x reduction in query time for the largest ad-hoc queries, and a 40% reduction in infrastructure cost." Key vocabulary: cutover (the moment of transition from old to new system), migration, parallel run (running old and new systems simultaneously to validate the output), decommission.

Section 4: SQL Review and Code Review Language

SQL code review is a critical skill for data engineers. You need to be able to explain what a query does, identify inefficiencies, suggest improvements, and communicate these findings clearly in code review comments or pair-programming sessions.

Describing SQL Queries

When explaining a SQL query to a colleague: "This query joins the orders table with the customer dimension on the customer ID, filters to orders from the last 30 days using the event timestamp, groups by customer segment, and aggregates the total revenue and average order value per segment. The HAVING clause then filters to segments with at least 100 orders in the period." The key SQL description verbs: join, filter, aggregate, group by, sort by, partition by (window function), rank, deduplicate.

When reviewing a query for performance: "This query will be very slow on the orders table — it's doing a full table scan because it applies a function to the event_timestamp column in the WHERE clause, which prevents the partition pruning from working. We should rewrite it to filter using the date partition column directly." / "The subquery in the SELECT clause is a correlated subquery — it runs once for every row in the outer query. With 10 million rows, that will be catastrophic. We should refactor this as a LEFT JOIN or a window function." Common SQL review phrases: "This will be expensive because...", "We should add an index on...", "This query is not partition-aware — it scans all partitions.", "We can simplify this with a CTE."

Code Review Comments for dbt Models

Reviewing dbt models requires additional vocabulary: "The ref() call here creates a dependency on the orders model — make sure that model is tested before this one runs." / "This model is materialised as a view, but it's referenced by 12 downstream models and is quite expensive to compute. I'd suggest materialising it as a table or using incremental materialisation." / "The test coverage for this model is incomplete — there's no uniqueness test on the order_id column. Please add a dbt test." / "The model documentation is missing. Please add a schema.yml entry with a description for each column." Review vocabulary: materialisation, ref(), lineage, test coverage, schema.yml, incremental.

Section 5: Data Quality Discussion Vocabulary

Data quality is one of the most important and most underestimated responsibilities of a data engineer. Being able to discuss data quality clearly — with both technical colleagues and business stakeholders — requires a specific vocabulary.

Describing Data Quality Dimensions

Data quality is typically described along six dimensions: completeness (are all expected records present? Are required fields populated?), accuracy (do the values reflect reality?), consistency (does the data agree across systems?), timeliness (is the data fresh enough?), validity (do values conform to expected formats and ranges?), and uniqueness (are there duplicate records?). Example usage: "We have a completeness issue on the product description column — 23% of rows have a NULL value there." / "There's a consistency issue between the orders table and the fulfilment system — the order counts don't reconcile." / "The customer email field has a validity issue — 1.2% of rows contain values that are not valid email addresses."

Data quality vocabulary in practice: validate (check data against rules), reconcile (compare two datasets to confirm they agree), deduplicate (remove duplicate records), enforce a schema (require data to conform to a defined structure), anomaly detection (automatically identify unusual data patterns), data contract (a formal agreement between producer and consumer on the data schema and SLAs). Example: "We have implemented a data contract with the mobile app team. They commit to maintaining the event schema for 90 days before deprecating any field, and we commit to building our pipeline on the schema version they declare. This prevents the silent breaking changes that caused two pipeline outages last quarter."

Data Quality Monitoring Language

When setting up data quality monitoring: "We're using Great Expectations to define a suite of expectations for the orders table. The suite checks that the order_id is unique and not null, that the revenue amount is positive, that the status column is in the set of valid values, and that the daily row count is within 20% of the 7-day average. If any expectation fails, the pipeline fails and an alert fires." / "We've added a data freshness check to the Grafana dashboard — it shows the time since the last successful pipeline run for each critical table. The SLA for the orders table is 4 hours."

Practice these skills

Section 6: Sprint Planning for Data Teams

Data engineering teams typically work in agile sprints, but the nature of data work creates some specific communication challenges around estimation, scope, and stakeholder expectations that differ from pure software engineering teams.

Estimating Data Engineering Work

Estimation is particularly challenging for data engineering because data quality issues, schema changes in source systems, and infrastructure limitations can significantly increase the effort required. Communicating this uncertainty clearly is essential: "I'm estimating this pipeline at 5 days of engineering effort, but there are two significant unknowns. First, I don't know the quality of the source data — we'll only discover issues when we start ingesting, and fixing data quality problems can add 2-3 days. Second, the source system's schema is not documented. I've built in a spike to spend 1 day understanding it before committing to the full estimate."

Sprint planning vocabulary specific to data teams: spike (a time-boxed investigation), data discovery (understanding a new source system's schema and data quality), proof of concept (PoC) (a minimal implementation to validate technical feasibility), backfill (filling historical data — often a separate work item), scope creep (the pipeline scope expanding beyond original requirements). Example: "I want to flag potential scope creep on this ticket. The original requirement was to build a pipeline for the EU orders. After talking to the analytics team, they now also want historical data going back 3 years. The backfill alone is a separate week of work — should we create a separate ticket for it?"

Communicating Blockers and Dependencies

Data pipelines are highly dependent on upstream systems and teams. Communicating blockers clearly and escalating appropriately is a critical skill: "I'm blocked on the Salesforce pipeline — I need read access to the Opportunities object in the production Salesforce instance. I've raised a request with the Salesforce admin team but haven't heard back in 3 days. Can you help escalate this?" / "This task depends on the mobile app team completing their event tracking implementation. They are scheduled to deploy it next Tuesday, but I have a risk — if they slip, our sprint commitment is at risk. I'll flag if I hear anything."

Practice these skills

Section 7: Interview Vocabulary for Data Engineers

Data engineering interviews test both technical depth and communication ability. You need to articulate complex architectural decisions, discuss trade-offs, and explain your past work clearly to interviewers who may have a different technical background.

Core Technical Vocabulary

The interview vocabulary for data engineering centres on a set of key technical concepts. Data warehouse vs data lake vs data lakehouse: "A data warehouse stores structured, processed data in a schema-on-write model. A data lake stores raw data in its native format — schema-on-read — giving flexibility but requiring more transformation work downstream. A data lakehouse combines the structured query capabilities and ACID transactions of a warehouse with the low-cost raw storage of a lake, using formats like Delta Lake or Apache Iceberg."

Batch vs streaming: "Batch processing runs on a schedule — daily, hourly — and is simpler to operate but introduces latency. Stream processing with Kafka and Flink or Spark Streaming allows near-real-time data processing but is operationally more complex and harder to reprocess historical data. The choice depends on the latency requirements of the use case." Spark: "Apache Spark is a distributed computing framework for large-scale data processing. It executes computations in-memory across a cluster, making it much faster than MapReduce for iterative algorithms. I use it for large transformation jobs that exceed the capacity of a single warehouse." Kafka: "Apache Kafka is a distributed event streaming platform. It acts as a durable, high-throughput message queue. We use it as a reliable event bus between services, allowing us to ingest event streams into the data platform asynchronously without coupling the source systems to the pipeline."

Discussing dbt and Modern Data Stack

The "modern data stack" is a common interview topic: "I work with a modern data stack: Fivetran for data ingestion from SaaS sources, Snowflake as the cloud data warehouse, dbt for transformations — which we commit to git and deploy via GitHub Actions — and Looker for business intelligence. dbt is the transformations layer that replaced our ad-hoc SQL scripts. Every model is version-controlled, tested, and documented. We have full data lineage from source to dashboard."

When discussing trade-offs in design decisions: "We chose to use an ELT pattern rather than ETL because our warehouse — Snowflake — has enough compute power to handle the transformations cheaply and at scale. The benefit is that we always have the raw data in the warehouse, so if we need to add a new transformation or fix a historical issue, we don't need to reingest from the source. The trade-off is that raw data in the warehouse can include PII, which requires careful access controls." This kind of trade-off discussion — stating the decision, the reasoning, and the trade-offs — is what interviewers at senior levels are looking for.

Most Useful Vocabulary & Phrases for Data Engineers

ingest data
'The ingestion layer pulls raw events from Kafka and writes them to the bronze table in the data lake.'
backfill the pipeline
'After fixing the timezone bug, we need to backfill three months of historical data through the corrected transformation.'
data lineage
'Our dbt project provides full data lineage from the raw source tables to the Gold-layer revenue models.'
idempotent pipeline
'We designed the pipeline to be idempotent — rerunning it after a failure produces the same output without duplicates.'
schema evolution
'We need a strategy for schema evolution — when the upstream API adds new fields, our pipeline should handle them gracefully.'
data contract
'The mobile app team and our pipeline team have agreed on a data contract: they will not remove or rename event fields without 30 days notice.'
reconcile the totals
'The finance team noticed the sales totals don't match between the warehouse and the source CRM — we need to reconcile the two systems.'
partition pruning
'The query is slow because it's not using partition pruning — it scans all 365 date partitions instead of just the ones in the filter.'
slowly changing dimension
'The customer address is a Type 2 slowly changing dimension — we track historical values so that old orders retain the address they were placed with.'
materialise a view
'This dbt model is too expensive to compute on every query — we should materialise it as a table and refresh it nightly.'
data freshness SLA
'Our data freshness SLA for the orders table is 4 hours — stakeholders can trust the data is never more than 4 hours old.'
dead-letter queue
'Records that fail schema validation are sent to a dead-letter queue for manual inspection rather than silently dropped.'
fan-out pattern
'We use a fan-out pattern from the Kafka topic — one consumer group writes to the data lake, another feeds the real-time dashboard.'
medallion architecture
'We follow the medallion architecture: bronze for raw data, silver for cleaned data, gold for business-ready aggregates.'
upsert on the primary key
'To ensure idempotency, the load job upserts on the order ID — new records are inserted, existing ones are updated.'
watermark
'The Flink streaming job uses a watermark of 2 minutes to handle late-arriving events before triggering the aggregation window.'
validate data
'Great Expectations runs a suite of validation checks on every pipeline run — if any check fails, the job fails and alerts fire.'
event-driven ingestion
'We moved from scheduled batch ingestion to event-driven ingestion — the pipeline triggers as soon as a new file lands in S3.'
cluster autoscaling
'The Spark cluster autoscaling is configured to add workers during the morning ETL window and scale down afterwards to save cost.'
grain of the fact table
'We agreed the grain of the orders fact table is one row per order line item, not one row per order — this allows line-level revenue analysis.'

Recommended Learning Path for Data Engineers

Stage 1: Foundation — Core Data Vocabulary

  1. 1
    Data Engineering Vocabulary

    Build the core vocabulary of data engineering: pipeline patterns, warehouse concepts, streaming terms, and the standard English used in data platform discussions.

  2. 2
    Data Engineering Collocations

    Practise the verb-noun collocations that data engineers use daily: ingest data, transform records, load the warehouse, validate quality, reconcile totals.

  3. 3
    Database Collocations

    Master the SQL and database vocabulary used in code reviews and architecture discussions: query, index, normalise, migrate, cache.

Stage 2: Intermediate — Pipeline and Quality Communication

  1. 4
    Data Pipeline Collocations

    The verbs of ETL/ELT pipelines: ingest, process, transform, validate, load — practised in realistic data engineering sentences.

  2. 5
    Architecture Design Collocations

    Design, scale, decouple, abstract — the vocabulary of discussing data platform architecture decisions with technical and non-technical colleagues.

  3. 6
    Code Review Collocations

    The English of reviewing SQL and dbt models: requesting changes, addressing comments, discussing query performance, and leaving actionable feedback.

Stage 3: Advanced — Stakeholder Communication and Interviews

  1. 7
    Stakeholder Management Language

    Translate technical pipeline failures, data quality issues, and migration plans into business language that non-technical stakeholders can act on.

  2. 8
    Technical Interview Language

    Articulate data warehouse vs lakehouse trade-offs, explain ETL/ELT design decisions, and discuss Spark, Kafka, and dbt architecture in interview settings.

  3. 9
    Presentations Language

    Present data infrastructure proposals, migration plans, and data quality reports to stakeholder audiences using clear, structured professional English.

Exercise Sets for Data Engineers

Practise the vocabulary and communication patterns covered in this guide with these focused exercise sets:

Vocabulary exercises

Collocations & interview preparation