English Vocabulary for Data Engineering Discussions
Master essential data engineering vocabulary: medallion architecture, data lineage, schema evolution, CDC, watermarks, exactly-once semantics, and data contracts.
Data engineering has its own dense vocabulary that can make conversations feel impenetrable if you are not familiar with the terms. Whether you are joining a discussion about pipeline reliability, attending a data platform review, or writing documentation, understanding these concepts — and being able to talk about them naturally in English — will help you contribute confidently to data engineering teams.
Key Vocabulary
Medallion architecture — A data organisation pattern with three layers: bronze (raw, unprocessed data), silver (cleaned and validated data), and gold (aggregated, business-ready data). Example: “We store all raw event logs in the bronze layer and run transformation jobs to promote them to silver.”
Data lineage — The ability to trace where data came from, how it was transformed, and where it went. Example: “Our data lineage tooling lets us see exactly which upstream source introduced that null value.”
Schema evolution — The process of changing a data schema over time without breaking downstream consumers. Example: “We use Avro with schema evolution support so adding new fields does not break existing pipelines.”
Partition pruning — A query optimisation technique where the database skips scanning irrelevant partitions based on query filters. Example: “Querying by date activates partition pruning and reduces the scan from terabytes to gigabytes.”
Late-arriving data — Data that arrives in a system after the time window it belongs to has already been processed. Example: “We need to handle late-arriving data because mobile app events can be delayed by up to 48 hours.”
Watermark — In stream processing, a watermark is a threshold that tells the system how far behind real time the data is allowed to be before a window is considered complete. Example: “We set a watermark of 10 minutes to allow for late-arriving data before closing the aggregation window.”
Exactly-once semantics — A processing guarantee ensuring that each piece of data is processed exactly one time — not zero times, not twice. Example: “Kafka Streams supports exactly-once semantics for critical financial transaction pipelines.”
Data contract — A formal agreement between data producers and consumers defining the schema, format, SLA, and quality expectations of a dataset. Example: “The payments team publishes a data contract that guarantees the event schema will not break without a 30-day deprecation notice.”
SLA vs SLO for pipelines — An SLA (Service Level Agreement) is a contractual commitment, usually with penalties for breach. An SLO (Service Level Objective) is an internal target. Example: “Our internal SLO is to process events within five minutes, but the SLA we signed with the client guarantees one-hour freshness.”
CDC (Change Data Capture) — A technique for capturing row-level changes in a database and streaming them to downstream systems. Example: “We use CDC to replicate changes from our production Postgres database into the data warehouse in near-real time.”
How to Use This in Practice
In data engineering discussions, these terms often appear in combinations. Practise hearing them together:
- “We need to handle late-arriving data in the silver layer by extending the watermark.”
- “The data contract specifies the schema evolution rules — you cannot remove fields without a deprecation period.”
- “Exactly-once semantics are critical for this pipeline because we are processing financial events.”
When discussing pipeline reliability, distinguish between SLAs (external, contractual) and SLOs (internal targets). Teams often talk about “missing SLOs” as an early warning before they risk “breaching SLAs.”
Data lineage discussions often use verbs like “trace”, “track”, “audit”, and “surface”: “Can we surface the lineage of this column in the dashboard?” or “We need to audit the lineage before decommissioning that source.”
Example Conversation
Data Engineer (Anastasia): “We’re seeing stale data in the gold layer. The numbers are off by about 12 hours.”
Pipeline Lead: “Is this a late-arriving data issue or a pipeline failure?”
Anastasia: “It’s late-arriving data. Mobile events from offline users are arriving outside our current watermark. I’d suggest extending the watermark from 10 minutes to 4 hours for this specific source.”
Pipeline Lead: “That will delay window closure. Will we still meet our SLO of one-hour freshness for the business dashboard?”
Anastasia: “Not for mobile events. We should update the data contract for that source to reflect a 4-hour freshness guarantee and notify the analytics team.”
Practice Tips
-
Map a pipeline in English: Draw a simple data pipeline for a project you know, then describe each stage aloud in English using the vocabulary from this post. For example: “Raw events land in the bronze layer via CDC from our Postgres database. A Spark job validates and cleans them before promoting to silver…”
-
Explain medallion layers to a non-engineer: Try explaining bronze, silver, and gold layers to someone who is not a data engineer, using an analogy (for example: raw ore, refined metal, finished product). This forces you to connect the technical vocabulary to plain English.
-
Read a streaming architecture blog post: The Confluent blog, the Databricks engineering blog, and the Airflow documentation use all of these terms naturally. Read one article and highlight every term from this post that you find. Try to infer the meaning from context before checking the definition.