English for Data Pipeline Engineers: Vocabulary and Phrases

Data pipeline vocabulary, ETL/ELT language, orchestration, data quality, and communication patterns for data engineers.

Data engineering has its own dialect. Whether you are joining a standup, writing a post-mortem, or reviewing a colleague’s PR, you will encounter a dense cluster of terms that do not appear in general English dictionaries. This post breaks down the most important vocabulary for data pipeline engineers — with plain-English definitions and real conversation examples so you can use each term confidently.


Core Terms: Pipelines and Graphs

Pipeline — a sequence of automated steps that move or transform data from a source to a destination. The word is borrowed from Unix shell scripting but is now used broadly across all data stacks.

“The ingestion pipeline is failing silently — it writes zero rows but exits with code 0. Can you add a row-count assertion at the end?”

“We have three separate pipelines for clickstream, CRM, and payments. Eventually we want to unify them, but for now each team owns their own.”

DAG (directed acyclic graph) — a way of representing task dependencies as a graph where edges point in one direction and no task can depend on itself (directly or indirectly). Most modern orchestrators model workflows as DAGs.

“Your DAG has a cycle — Task C depends on Task B, which depends on Task A, which depends on Task C. The scheduler will refuse to run it.”

“I’ll add the new transformation as a node in the DAG, downstream of the normalisation step.”

Orchestration — the automated scheduling and coordination of pipeline tasks. Tools like Airflow, Prefect, and Dagster are the most widely used orchestrators in the industry.

“We migrated from Airflow to Prefect last quarter. The dynamic task mapping in Prefect is much cleaner for our use case.”

“Orchestration handles retries, alerting, and dependency resolution so we do not have to manage that ourselves with cron jobs.”

Task dependency — a rule that says one task must complete (successfully) before another task can start.

“There is a task dependency between the raw load and the dedupe step — if the load fails, the dedupe never runs, which is exactly what we want.”


ETL/ELT Language and Load Patterns

Incremental load vs full refresh — two fundamental strategies for updating a dataset. An incremental load appends or merges only new or changed records since the last run. A full refresh drops and replaces the entire dataset on every run.

“For the transactions table we use incremental load keyed on updated_at. For the product catalogue we do a full refresh nightly because it’s small and the SCD logic isn’t worth the complexity.”

“Watch out — full refresh on a 200-million-row table is going to cause lock contention. Let’s switch to incremental.”

Backfill — running a pipeline for a historical time range that was missed, often because the pipeline did not exist yet or failed during that period.

“We added event tracking last month. We need to backfill six months of historical data before we can do any meaningful trend analysis.”

“The backfill job is idempotent, so you can re-run it safely if it times out halfway through.”

Idempotent pipeline — a pipeline that produces the same result no matter how many times it is run for the same input or time window. This is a critical design property for safe retries and backfills.

“Always design your pipelines to be idempotent. If an insert-overwrite job runs twice, the table should look identical — not have duplicate rows.”

“The pipeline isn’t idempotent right now because it appends unconditionally. We need to add a MERGE or a delete-then-insert pattern.”


Data Quality and Schema Management

Data quality check — a validation step that tests whether data meets defined rules before it is passed downstream. Checks can cover completeness, uniqueness, referential integrity, value ranges, and more.

“We have data quality checks running after every load. If the null rate on customer_id exceeds 2%, the downstream tasks are blocked automatically.”

Expectation (Great Expectations) — in the context of the Great Expectations framework, an expectation is a declarative assertion about your data (for example, “this column should never be null”, “values should be between 0 and 100”). The term has become common shorthand across the industry.

“I added an expectation that order_total is always positive. It caught a data entry bug in the upstream CRM within the first day.”

“The expectation suite is stored in version control alongside the pipeline code. That way schema and quality rules evolve together.”

Schema evolution — the process of handling changes to the structure of a dataset (adding columns, renaming fields, changing types) without breaking downstream consumers.

“The upstream team added three new columns without notice. Our pipeline broke because we were doing SELECT * and the Avro schema didn’t match. We need a proper schema evolution strategy.”

“We use Delta Lake’s automatic schema evolution for additive changes, but breaking changes — like renaming a column — still require a coordinated migration.”

Data lineage — a record of where data came from, how it was transformed, and where it went. Lineage is essential for debugging, compliance, and impact analysis.

“We can’t just delete that source table — the data lineage graph shows it’s referenced by fourteen downstream models. We need to deprecate it properly.”

“With OpenLineage integrated into Dagster, we get lineage metadata automatically. It’s a huge help when tracing a quality issue back to its source.”


Reliability, Latency, and Failure Patterns

Late-arriving data — records that arrive at the processing system after the time window they logically belong to has already been processed. Common in event-driven and IoT systems.

“We see late-arriving data from mobile clients all the time — events generated offline get synced hours later. Our aggregations need to account for that.”

Watermark — a threshold that tells a streaming or micro-batch system how long to wait for late-arriving data before closing a time window. Once the watermark passes, late records for that window may be dropped or routed elsewhere.

“Set the watermark to 30 minutes. Anything arriving later than that goes to the dead letter queue for manual review.”

“The watermark is too aggressive — we’re dropping events that are only 10 minutes late. Let’s extend it to an hour and re-evaluate.”

Dead letter queue — a separate storage location for records that failed processing (due to schema errors, quality violations, or system faults). Rather than failing the whole pipeline, bad records are isolated so the rest can continue.

“We send malformed JSON straight to the dead letter queue. The ops team reviews it weekly and either fixes the upstream producer or discards the records.”

“Make sure the dead letter queue has alerting set up — it’s a silent failure mode if nobody notices it filling up.”

SLA breach (in data contexts) — a situation where a pipeline or dataset fails to meet its agreed service-level agreement, typically defined as a deadline by which fresh data must be available downstream.

“The reporting pipeline has an SLA of 06:00 UTC. We breached it this morning — the finance team’s dashboard showed yesterday’s numbers until 08:30. We need a post-mortem.”

“I’ll add an SLA check in Airflow so we get a Slack alert if the DAG hasn’t completed by 05:45. That gives us fifteen minutes to intervene before the breach.”


How to Use These in Conversation

Data engineers regularly communicate across three contexts: standups/syncs, incident reviews, and architecture discussions. Here is how the vocabulary fits in practice.

During a standup:

“Yesterday I finished the idempotent backfill for the orders table. Today I’m working on adding expectations for schema evolution in the product feed pipeline. No blockers, but I want to flag that late-arriving data from the mobile app is causing watermark issues in the streaming job.”

During an incident review:

“The SLA breach at 06:20 was caused by a task dependency failure upstream. The data quality check failed on the raw load, which correctly blocked the downstream DAG. Records with null user_id went to the dead letter queue. Root cause: the producer deployed a schema change without updating the expectation suite.”

When proposing a design:

“I’d like to move from a full refresh to an incremental load pattern, and make the pipeline idempotent using MERGE. We should also add data lineage tracking from the start — it’ll save a lot of time when we need to debug orchestration issues later.”

Using precise terminology signals competence and keeps conversations efficient. When in doubt, prefer the specific term over a vague description — “the DAG has a cycle” is clearer and faster than “the tasks are depending on each other in a loop”.


Quick Reference

TermPlain-English Meaning
PipelineAutomated sequence of steps moving or transforming data
DAGGraph of tasks with directional, non-circular dependencies
OrchestrationAutomated scheduling and coordination of pipeline tasks
BackfillRunning a pipeline over a missed historical time range
Idempotent pipelinePipeline that produces the same result on repeated runs
Incremental loadLoading only new/changed records since the last run
ExpectationDeclarative data quality assertion (Great Expectations term)
Schema evolutionHandling structural changes to datasets without breaking consumers
WatermarkLateness threshold for closing a streaming time window
Dead letter queueIsolated storage for records that failed processing
SLA breachFailure to deliver fresh data by an agreed deadline
Data lineageRecord of data’s origin, transformations, and destinations