English for Data Governance Teams: Quality, Lineage, and Catalog Discussions

Master the English vocabulary and communication patterns for data governance — data quality, lineage, catalog management, and the language of data stewardship in English.

Data governance is one of the fastest-growing disciplines in the data industry — and one where the vocabulary is especially dense. Data quality frameworks, lineage graphs, catalog metadata, and stewardship processes all have precise terminology that differs significantly from general software engineering language.

For non-native English speakers working in data engineering, analytics engineering, or data platform roles, this vocabulary is essential for participating in governance reviews, writing data quality reports, and managing relationships with data consumers and producers.

This guide covers the key concepts and the specific English communication patterns used in data governance contexts.

What Data Governance Actually Covers

Data governance is the framework of processes, policies, roles, and standards that ensure data is accurate, consistent, discoverable, secure, and used responsibly. It answers questions like:

  • “Where did this number come from?”
  • “Who is responsible for this dataset?”
  • “Is this data fit for the purpose we’re using it for?”
  • “Who has access to this sensitive field, and why?”

The governance team typically includes data stewards, data owners, data engineers, and compliance officers — all of whom need to communicate precisely about data.

Data Quality Vocabulary

Data quality is measured across multiple dimensions. Knowing the vocabulary of each dimension lets you write accurate quality reports and participate in quality reviews.

1. Accuracy

The degree to which data correctly reflects the real-world entity it represents.

Usage: “The customer revenue field has an accuracy issue — it includes refunded transactions that were not properly excluded, so the figures are overstated by approximately 7%.“

2. Completeness

The degree to which all required data is present and not missing.

Usage: “The completeness score for the email field is 73% — 27% of records are missing a valid email address, which is preventing the email campaign from running.”

Common completeness vocabulary:

  • “Null values” — missing data in a field
  • “Mandatory field” — a field that must not be null
  • “Fill rate” — the percentage of non-null values: “The fill rate for the phone_number field is 61%.”
  • “Missing at random (MAR)” — when the probability of missing data is unrelated to the missing value
  • “Missing not at random (MNAR)” — when the missingness is related to the value itself (e.g., high earners not reporting income)

3. Consistency

The degree to which data is consistent across systems and datasets — the same entity described the same way in multiple places.

Usage: “There’s a consistency issue between the CRM and the data warehouse — the customer record for ‘Acme Corp’ uses different currency codes in each system. The CRM shows GBP, but the warehouse shows USD.”

4. Timeliness / Freshness

The degree to which data is up-to-date for its intended use. Also called freshness in modern data stack contexts.

Usage: “The product catalogue table has a freshness SLA of 4 hours. The current lag is 11 hours — the pipeline failed at 03:00 UTC and the on-call has been paged.”

Common timeliness vocabulary:

  • “Data lag” — the delay between when something happens and when it is reflected in the data
  • “Freshness SLA” — the maximum acceptable lag
  • “Stale data” — data that exceeds its freshness threshold
  • “Backfill” — the process of reprocessing historical data to fix or add data

5. Uniqueness

The degree to which records are not duplicated. Duplicate records are one of the most common and damaging data quality issues.

Usage: “The uniqueness check on the order_id field failed — we have 1,247 duplicate order IDs in the last month’s data. This is causing double-counting in the revenue dashboard.”

6. Validity

The degree to which data values conform to expected formats, ranges, or reference lists.

Usage: “We found 340 records where the country_code field contains values not in the ISO 3166 standard list. These are causing failures in the downstream tax calculation service.”

Data Quality Report Language

When writing a data quality report, use structured, precise language:

“The orders table in the production data warehouse was assessed for data quality across five dimensions. Accuracy and uniqueness are within acceptable thresholds. Two issues were identified:

1. Completeness: The shipping_address field has a fill rate of 64%, below the 95% target. Root cause: the new mobile checkout flow does not capture shipping address for digital products. Recommended action: add conditional field capture logic in the mobile checkout (ticket: DATA-891).

2. Timeliness: The table’s freshness SLA is 2 hours. Current P95 lag is 3.4 hours, driven by a bottleneck in the transformation pipeline during peak load. Recommended action: investigate partition pruning optimisation (ticket: DATA-892).”

Data Lineage Vocabulary

Data lineage describes where data comes from, how it has been transformed, and where it flows to. It is the audit trail of your data pipeline.

7. Lineage

The full traceable history of a data asset — from its source system through all transformations to its current location.

Usage: “The lineage for the monthly_revenue metric traces back through three transformation layers to the raw stripe_events table ingested from the Stripe API.”

8. Upstream / Downstream

Upstream — data or systems that feed into the current dataset Downstream — data or systems that consume the current dataset

Usage: “The user_attributes table has 14 downstream dependencies — before we modify the schema, we need to assess impact on all 14.”

9. Source System (Source of Record / System of Record)

The authoritative system from which data originates. The source of record is the definitive, trusted source for a specific data entity.

Usage: “For customer identity data, the CRM is the system of record. Any discrepancies between the data warehouse and the CRM should resolve in favour of the CRM.”

10. Data Pipeline

The sequence of steps — ingestion, transformation, loading — that move data from source to destination.

Common pipeline vocabulary:

  • “Ingestion” — the first step: bringing raw data into the platform
  • “Transformation” — cleaning, shaping, and enriching data
  • “Loading” — writing transformed data to its destination
  • “Orchestration” — scheduling and managing pipeline execution
  • “DAG” (Directed Acyclic Graph) — the structure used by orchestration tools (Airflow, Dagster) to represent pipeline dependencies

11. Breaking Change

In the context of data lineage, a change to a dataset (schema change, logic change, rename) that breaks downstream consumers.

Usage: “Renaming the customer_id column to account_id is a breaking change — all 14 downstream dbt models reference the old column name and will fail.”

How to communicate a breaking change:

“This is a breaking change to the schema of the orders table. We are removing the deprecated legacy_status_code field and renaming customer_ref to customer_id. Downstream teams should plan for migration by Friday 20 June. A migration guide is available at [link]. Please reach out if your pipeline depends on these fields so we can coordinate.”

Data Catalog Vocabulary

A data catalog is a centralised inventory of data assets — tables, fields, dashboards, metrics — with metadata that makes them discoverable and understandable.

12. Data Asset

Any data entity managed in the catalog — a table, a view, a dashboard, a metric definition, a data model.

13. Metadata

Information about data rather than the data itself. Types of metadata:

  • Technical metadata — schema, data types, row counts, last updated timestamp
  • Business metadata — business description, owner, classification, SLA
  • Operational metadata — pipeline run history, quality check results

14. Data Owner

The person or team accountable for the quality, accuracy, and appropriate use of a dataset. Not the engineer who built the pipeline — the business stakeholder who is responsible for the data’s meaning.

Usage: “For the revenue metric, the data owner is the Finance team. Any changes to the metric definition require their sign-off.”

15. Data Steward

The person responsible for the day-to-day management and documentation of a dataset — maintaining metadata, resolving quality issues, and being the first point of contact for data consumers.

16. Data Dictionary

A document or catalog feature that defines each field in a dataset: name, description, data type, allowed values, and any business rules.

Usage: “Before consuming this table, please review the data dictionary — particularly the notes on the adjusted_revenue field, which excludes specific revenue categories that are documented there.”

17. Business Glossary

A centralised list of business terms and their official definitions — ensuring that “active customer”, “monthly recurring revenue”, and “churn” mean the same thing across all teams.

Usage: “We had a discrepancy in the executive dashboard because Marketing and Finance were using different definitions of ‘active customer’. We resolved it by adding the agreed definition to the business glossary and linking it from both dashboards.”

18. PII — Personally Identifiable Information

Data that can be used to identify a specific individual — names, email addresses, phone numbers, IP addresses, biometric data. Datasets containing PII require special handling, access controls, and documentation.

Usage: “This table contains PII — email address and IP address. Access is restricted to the Analytics team under the data access policy. Do not copy PII to staging environments.”

19. Data Classification

Categorising data assets by sensitivity level — often: Public, Internal, Confidential, Restricted. Classification determines access controls, retention policies, and handling requirements.

20. Retention Policy

The rule governing how long data is kept before it is deleted or anonymised. Driven by legal requirements (GDPR, CCPA) and business needs.

Usage: “Under our GDPR retention policy, customer personal data is deleted 24 months after the account is closed. The automated deletion job runs on the 1st of each month.”

Communication Patterns for Data Governance Discussions

Raising a Data Quality Issue

“I want to flag a data quality concern with the orders table. We’ve detected approximately 1,200 duplicate records in the past 30 days, introduced by a change to the ingestion pipeline on 3 June. The duplicates are causing revenue to be overstated by approximately £47,000 in the June report. I’ve created a ticket (DATA-901) and tagged the pipeline owner. We recommend pausing the June revenue report until the duplicates are resolved and historical data is corrected.”

Requesting Lineage Documentation

“Before we build the new attribution model on top of this dataset, could we review the lineage for the campaign_events table? Specifically, I’d like to understand how the attribution_weight field is calculated — the business description in the catalog is vague, and I want to make sure our model is built on the correct understanding.”

Responding to a Data Consumer’s Question

“Good question about the monthly_active_users metric. According to our business glossary, a ‘monthly active user’ is defined as a user who has completed at least one session of 60 seconds or more in the calendar month. This definition was agreed with the Product team in January 2025 and is noted in the metric definition in the data catalog. If you have a different use case that requires a different definition, we can create a variant metric — let’s set up a quick call.”

Key Takeaways

  • Data quality has five key dimensions: accuracy, completeness, consistency, timeliness, and uniqueness. Use the right term for the right issue.
  • Lineage vocabulary — upstream, downstream, source of record — is essential for impact analysis and incident communication.
  • Breaking changes must be communicated proactively: what is changing, who is affected, when, and how to migrate.
  • The data catalog gives every asset discoverable metadata — always link to it in communications rather than restating definitions inline.
  • Data owner (accountable) and data steward (operational) are distinct roles — use the right term.
  • PII and data classification are not technical concepts — they are governance responsibilities that require precise language.

Data governance work is fundamentally about communication — between teams, across systems, and over time. Clear English makes governance policies enforceable and data assets trustworthy.