Practise vocabulary for data lineage definition, upstream vs. downstream datasets, lineage graphs, column-level lineage, and impact analysis.
0 / 5 completed
1 / 5
Data lineage is best described as:
Data lineage answers "Where did this number come from?" and "How was it transformed?" It is foundational to debugging data quality issues, understanding impact of changes, and meeting regulatory audit requirements (BCBS 239, GDPR). Tools: OpenLineage, Marquez, Atlas, modern data catalogs.
2 / 5
In a data pipeline, an "upstream" dataset is:
Upstream = earlier in the data flow (closer to the source). Downstream = later (closer to the consumer). If the orders table is upstream of the monthly_revenue model, any schema change to orders can break monthly_revenue. Lineage graphs visualise these dependency chains.
3 / 5
Column-level lineage tracks:
Column-level lineage is far more precise than table-level lineage. Example: "revenue_usd in the weekly_summary table is derived from amount_cents in raw.orders divided by 100, filtered where status = 'completed'". This enables targeted impact analysis — if amount_cents changes, you know exactly which output columns are affected.
4 / 5
Impact analysis in the context of data lineage means:
"Which reports are affected if I change this table?" is the classic impact analysis question. Without lineage, engineers must grep code repositories and ask colleagues. With lineage: navigate the graph, see all downstream consumers, notify owners proactively. This prevents silent data breakages.
5 / 5
The OpenLineage specification is:
OpenLineage (openlineage.io) standardises lineage event emission: Spark jobs, Airflow DAGs, dbt runs all emit lineage in the same format. Marquez is the reference backend. This solves the fragmentation problem where each tool had its own lineage format, making cross-tool lineage impossible.