Intermediate 12 terms

Data Lineage & Governance

Vocabulary for tracking data origins, managing data quality, and implementing governance frameworks in modern data platforms.

  • Data Lineage /ˈdeɪtə ˈlɪnɪɪdʒ/

    The end-to-end record of a data asset's journey — where it originated, how it has been transformed, where it has moved, and what systems consume it. Lineage enables impact analysis (what breaks if I change this?), root cause investigation (where did this bad value come from?), and compliance (can I demonstrate data handling to regulators?).

    "When the BI team noticed that monthly revenue figures differed between two dashboards, data lineage let us trace the discrepancy to a transformation step in the warehouse pipeline that was filtering out refunds differently in each branch. Without lineage, that investigation would have taken days."
  • Column-Level Lineage /ˈkɒləm ˈlevəl ˈlɪnɪɪdʒ/

    Data lineage tracked at the individual column or field level — not just which tables flow into which, but which specific fields transform into which output fields. Column-level lineage enables precise impact analysis and is required for PII data tracing and regulatory compliance.

    "Our column-level lineage implementation shows exactly which upstream source fields contribute to the 'customer_ltv' column — it traces through 4 transformation steps across 3 pipelines. When the sales team changed the revenue recognition rule, the lineage graph showed 23 downstream columns that would be affected, preventing silent data corruption."
  • Data Catalog /ˈdeɪtə ˈkætəlɒɡ/

    A centralised inventory of an organisation's data assets — tables, dashboards, pipelines, ML features — enriched with metadata, lineage, ownership, quality metrics, and usage statistics. Enables data discovery: analysts can find and understand datasets without asking engineers. Common tools: Datahub, Alation, Collibra, Atlan.

    "Before the data catalog, analysts spent 40% of their time asking engineers "what does this column mean?" and "is this dataset up to date?". After cataloging 3,000 tables with owners, descriptions, freshness SLAs, and sample data, self-service data discovery improved analyst productivity by an estimated 30%."
  • Data Steward /ˈdeɪtə stjuːərd/

    A person responsible for the accuracy, integrity, and availability of a specific data domain or dataset — typically a domain expert who acts as the bridge between business and engineering. Data stewards own the business glossary definitions, approve access requests, and are accountable for data quality in their domain.

    "The finance data steward owns the definition of 'recognised revenue' in the business glossary — they are the final authority on how that term is calculated and any changes to the definition must go through them. When there is a discrepancy between reporting systems, the data steward investigates and resolves it."
  • Business Glossary /ˈbɪznɪs ˈɡlɒsəri/

    A centralised dictionary of business terms with agreed definitions — what does "active customer" mean? Is a trial user "active"? Is a churned user who re-engaged "active"? A business glossary eliminates the ambiguity that causes different teams to report different numbers from the same underlying data.

    "We had five different definitions of 'churn rate' across the business — finance, product, and customer success each calculated it differently. The business glossary project took 8 weeks but produced 180 agreed definitions. Now when the board asks for churn rate, everyone reports the same number."
  • Metadata /ˈmetəˌdeɪtə/

    Data that describes other data — table schemas, column descriptions, data types, record counts, freshness timestamps, ownership, sensitivity classification (PII, financial), lineage relationships, and quality metrics. Rich metadata enables automated governance, self-service discovery, and reliable data products.

    "Our metadata platform automatically captures technical metadata (schema, row count, last update, query frequency) and enriches it with business metadata (owner, description, sensitivity, SLA) maintained by stewards. This combination lets consumers assess dataset fitness without consulting the owning team."
  • Data Quality /ˈdeɪtə ˈkwɒlɪti/

    The degree to which data meets the requirements of the systems and people that use it — measured across dimensions including completeness (no missing values), accuracy (correct values), timeliness (fresh enough), consistency (consistent across systems), and validity (values conform to expected rules).

    "Our data quality framework defines quality contracts for 150 critical datasets: completeness >99%, null rate <0.5% for key columns, freshness within 2 hours of source update, and referential integrity checks. Breaches trigger alerts to the data steward and block downstream pipeline execution until resolved."
  • Data Observability /ˈdeɪtə ˌɒbzɜːˈvəbɪlɪti/

    The ability to understand the health and state of data in a system at any point in time — detecting anomalies in volume, freshness, schema, and distribution automatically. Analogous to infrastructure observability (monitoring, alerting, tracing) applied to data pipelines. Tools: Monte Carlo, Soda, Great Expectations.

    "Data observability caught a schema change in an upstream source table before any dashboards broke — the new 'order_status' column had different value encoding than expected, which would have silently zeroed out 8% of order records in the aggregation. The observability platform alerted within 15 minutes of the schema change."
  • Data Governance /ˈdeɪtə ˈɡʌvənəns/

    The framework of policies, processes, roles, and responsibilities that ensure data is managed as a valuable, compliant, and trustworthy organisational asset. Encompasses data quality standards, access control, privacy compliance, retention policies, and data classification. Effective governance is an enabler of data use, not just a constraint.

    "Our data governance framework is structured around three layers: policies (what must be true — PII must be classified, all datasets must have an owner), processes (how to comply — access request workflow, classification process), and metrics (are we complying — tagging coverage, quality scores, access review completion rate)."
  • Certified Dataset /ˈsɜːtɪfaɪd ˈdeɪtəˌset/

    A data asset that has been formally validated by data stewards as meeting quality, documentation, and governance standards — recommended as the trusted source for a specific domain. Certification signals to consumers that this dataset is safe to use for reporting and decision-making.

    "Our data catalog has a Certified badge for 42 datasets — these have met the quality criteria: documented business glossary alignment, owner-verified definitions, quality SLA above 99%, and freshness within defined windows. Analysts are guided to use certified datasets first; uncertified tables require additional due diligence."
  • Data Product /ˈdeɪtə ˈprɒdʌkt/

    A curated, reliable, well-documented data asset built and maintained to serve specific consumer needs — treated as a product with an owner, SLA, versioning, and a consumer-facing interface. The data product paradigm (from data mesh) shifts responsibility from centralised data teams to domain teams who both produce and own their data.

    "The orders domain team owns the 'orders-events' data product: a Kafka stream and a warehouse table with a 99.9% freshness SLA, versioned schema, documented fields, and a Slack channel for consumers. Any breaking change goes through a deprecation notice to all consuming teams."
  • OpenLineage /ˈəʊpən ˈlɪnɪɪdʒ/

    An open-source standard and API specification for collecting, processing, and storing data lineage metadata across heterogeneous data tools. Supported by Apache Airflow, Spark, dbt, and other pipeline tools — enables vendor-neutral lineage capture from multiple systems into a single lineage graph.

    "We standardised on OpenLineage to capture lineage across our Airflow pipelines, dbt transformations, and Spark jobs — all emitting lineage events to Marquez. This gave us end-to-end lineage from source databases to production dashboards without requiring a proprietary lineage vendor."