Data Lineage Vocabulary: Provenance, Dependencies, and Impact Analysis
Learn the English vocabulary for data lineage — upstream and downstream dependencies, column-level lineage, DAGs, provenance, and how to discuss lineage in data reviews.
Why Data Lineage Vocabulary Matters
As data pipelines grow more complex, understanding where data comes from and what depends on it becomes critical for data quality, compliance, and incident response. Data lineage is the discipline of tracking this information. If you work in data engineering, analytics, or data governance, you will use lineage vocabulary constantly in reviews, documentation, and conversations with business stakeholders.
Core Lineage Concepts
Upstream and Downstream
These are the two most fundamental directional terms in data lineage, borrowed from the metaphor of a river.
Upstream — a dataset, table, or process that feeds data into the dataset you are looking at. If Table B reads from Table A, then Table A is upstream of Table B.
Downstream — a dataset, table, or process that consumes data from the dataset you are looking at. Table B, in the example above, is downstream of Table A.
“We need to notify the downstream consumers of the orders_daily table before we change the schema, because any downstream model that references the old column will break.”
Column-Level Lineage
Column-level lineage is lineage tracked at the individual column level, rather than just at the table or dataset level. It answers questions like: “Which source column did this derived column come from?” and “If I rename this field, which reports will be affected?”
“Column-level lineage showed that the revenue_usd column in the finance dashboard is derived from three source columns across two different raw tables — a fact that wasn’t documented anywhere.”
DAG (Directed Acyclic Graph)
A DAG is a graph of nodes (datasets or tasks) connected by directed edges (dependencies), with no cycles — meaning you cannot follow the edges and return to a node you have already visited. In data engineering, the DAG represents the dependency graph of your pipeline.
“The dbt DAG visualisation showed that the executive dashboard model has 14 upstream dependencies, which explains why a single schema change in the raw layer can have such wide impact.”
Tools like Apache Airflow and dbt expose DAGs as first-class concepts in their UI, making this vocabulary part of day-to-day conversation.
Provenance
Data provenance describes the origin and history of a dataset — where it came from, how it was transformed, who modified it, and when. It answers the audit question: “Can we trace this number back to its source?”
“The regulator asked us to provide data provenance for the risk exposure figures in the quarterly report. We used our lineage tool to generate a full trace from the dashboard metric back to the raw transaction records.”
Impact Analysis
Impact analysis is the process of identifying all downstream assets that would be affected by a proposed change to an upstream dataset. It is the answer to the question: “If I change this, what breaks?”
“Before deprecating the legacy_customer_id column, we ran an impact analysis and discovered it was still referenced in 23 downstream models — six of which fed into production dashboards.”
Language for Data Reviews
Proposing a Change
- “I’d like to walk through the lineage impact of the proposed schema change to the
eventstable.” - “The upstream source is changing its date format from ISO 8601 to Unix timestamps — I’ve mapped all the downstream models that will require updates.”
- “The impact analysis shows three critical downstream models; I’m proposing we update these before the source change goes live.”
Flagging a Dependency Issue
- “This model has a fragile upstream dependency on a manually uploaded CSV — we should replace it with a managed source.”
- “The lineage graph shows a cross-domain dependency that the data contract doesn’t document — we need to formalise this.”
Discussing Provenance in Compliance Contexts
- “We can trace the reported figure back through four transformation steps to the raw transactional data in the source system.”
- “Full column-level provenance is available in the lineage tool, and I can export it for the auditor’s review.”
Five Example Sentences
- “Before the upstream schema migration, we ran an impact analysis and tagged all 17 downstream models that reference the affected columns.”
- “Column-level lineage revealed that the
gross_marginfigure in the executive report is calculated differently in two separate models, which explains the discrepancy we observed last quarter.” - “The DAG for the nightly revenue pipeline has a fan-out of 34 downstream consumers from a single intermediate model, making it a critical point of failure.”
- “Data provenance documentation is now a mandatory output of our data quality review process for any metric that feeds into regulatory reporting.”
- “The downstream team was not notified of the upstream schema change, which is exactly the kind of incident our lineage tooling is designed to prevent.”
Tools Reference
dbt exposes lineage natively through its documentation site and dbt ls commands. OpenLineage is an open standard for capturing lineage events across different tools. DataHub, Atlan, and Monte Carlo are popular data catalogue and observability platforms that visualise lineage. Familiarity with these tool names is part of the working vocabulary of modern data engineering.