Data Governance Vocabulary for Engineers
60 essential data governance terms for data engineers: data stewardship, lineage, cataloguing, access control, data quality, compliance, and metadata management.
Data governance is the framework of policies, roles, standards, and processes that ensure data is trusted, traceable, and used responsibly across an organisation. As data engineering mature, engineers are expected to implement governance — not just build pipelines. This guide covers the 60 terms you need to participate fluently in data governance conversations.
Core Governance Concepts
Data Governance
Data governance is the overall framework for managing data as a corporate asset — ensuring data is accurate, available, consistent, secure, and used responsibly.
“Data governance isn’t just a compliance checkbox — it’s how we ensure our data is trustworthy enough to make decisions with.”
Data Management
Data management is the full lifecycle of data practices: collection, storage, processing, security, and disposal. Governance is the policies; management is the practices.
Data Strategy
A data strategy is an organisation-wide plan for how data will be collected, stored, managed, and used to achieve business objectives.
Roles and Responsibilities
Data Owner
A data owner is a senior business stakeholder accountable for a dataset or data domain — responsible for defining access policies and quality standards.
“The data owner for the customer domain is the VP of CX. She approves all new access requests for customer PII.”
Data Steward
A data steward is the operational practitioner responsible for the day-to-day quality, documentation, and management of a specific dataset or domain.
“Our data steward for the financial data maintains the data dictionary and reviews quality alerts every morning.”
Data Custodian / Data Engineer
The data custodian (often the data engineer) implements and enforces the technical controls mandated by the owner and steward — access controls, encryption, retention policies.
Chief Data Officer (CDO)
The CDO is the executive responsible for enterprise data strategy, governance, and data quality across the organisation.
Data Consumer
A data consumer is anyone who uses data for analysis, reporting, or building products — analysts, data scientists, product teams.
Data Quality
Data Quality
Data quality refers to the suitability of data for its intended purpose, measured across dimensions like accuracy, completeness, consistency, and timeliness.
Accuracy
Data accuracy is the degree to which data correctly reflects the real-world entity it represents. An incorrect customer address is an accuracy problem.
Completeness
Completeness is the proportion of records with non-null values in required fields. “Order completeness is 99.7% — 0.3% of orders lack a customer ID.”
Consistency
Consistency means the same data attribute has the same value across all systems. “Customer status is inconsistent between the CRM and the warehouse — they show ‘inactive’ and ‘churned’ for the same customer.”
Timeliness / Freshness
Timeliness (also freshness) is whether data is up-to-date enough for its intended use. “Financial reporting requires T+1 data freshness — yesterday’s transactions must be available by 6 a.m.”
Uniqueness
Uniqueness means a record (or combination of fields) appears exactly once — no duplicates. “The order_id must be unique across the entire orders table.”
Validity
Validity is whether data values conform to the defined rules and formats. An email field containing ‘N/A’ is a validity problem.
Data Quality Score / DQ Score
A DQ score is a composite metric summarising data quality across dimensions — typically expressed as a percentage.
Metadata and Cataloguing
Metadata
Metadata is data about data — descriptive information that explains what a dataset is, where it came from, who owns it, and how it should be used.
Types:
- Technical metadata — schema, data types, row counts, last updated
- Business metadata — descriptions, ownership, business definitions
- Operational metadata — pipeline run times, job status, SLA compliance
Data Catalogue (Data Catalog US)
A data catalogue is a searchable inventory of all data assets in an organisation — with metadata, lineage, ownership, and quality information. Examples: Apache Atlas, Collibra, Alation, Datahub.
“Before building your pipeline, check the data catalogue — the customer table might already be documented with ownership and quality SLOs.”
Data Dictionary
A data dictionary is a formal reference document defining the meaning, type, format, and constraints of every field in a dataset or database.
“The data dictionary says
statuscan only be one of: pending, confirmed, shipped, delivered, cancelled — any other value is a data quality issue.”
Business Glossary
A business glossary defines business terms in plain language — what does “active customer” mean? What is “revenue” by our definition?
“The finance and product teams have different definitions of ‘monthly active user’ — the business glossary should resolve this.”
Schema Registry
A schema registry stores and versioned schemas for streaming data (Kafka, Pulsar) — ensuring producers and consumers agree on the data format. Examples: Confluent Schema Registry, AWS Glue Schema Registry.
Data Lineage
Data Lineage
Data lineage is the record of where data comes from, how it flows through systems, and what transformations it undergoes — from source to destination.
“The auditor asked for data lineage on the revenue figure — I traced it from the orders table through three dbt models to the finance report.”
Upstream / Downstream
Upstream — systems or datasets that feed data into the current dataset. Downstream — systems or datasets that consume data from the current dataset.
“If we change the schema of the raw orders table, we need to assess the impact on all downstream pipelines.”
Column-Level Lineage
Column-level lineage tracks not just which tables flow where, but specifically which columns produce which output columns through transformations.
Access Control and Security
Role-Based Access Control (RBAC)
RBAC grants data access based on a user’s organisational role. “Analysts have read access to the silver layer; only the data platform team has access to raw.”
Attribute-Based Access Control (ABAC)
ABAC grants access based on attributes of the user, resource, and context — more granular than RBAC. “Employees in APAC can only see customer records where region=‘APAC’.”
Data Masking
Data masking replaces sensitive values with realistic but fictitious alternatives for non-production use. “In the dev environment, customer emails are masked: john@company.com becomes j.xxx@company.com.”
Tokenisation
Tokenisation replaces a sensitive value (e.g. credit card number) with a token — a non-sensitive substitute that maps back to the original through a secure token vault.
Column-Level Security
Column-level security restricts access to specific columns within a table based on user role — e.g., hiding the salary column from non-HR users.
Row-Level Security (RLS)
Row-level security restricts which rows a user can see — e.g., a regional manager sees only records for their region.
Compliance and Regulation
GDPR (General Data Protection Regulation)
GDPR is the EU regulation governing personal data — requiring consent, the right to deletion, data portability, and breach notification.
“This field contains EU citizen emails — it falls under GDPR. We need a retention policy and a documented lawful basis for processing.”
PII (Personally Identifiable Information)
PII is any data that can identify an individual — names, emails, phone numbers, IP addresses, biometrics.
Data Retention Policy
A data retention policy defines how long data is kept and when it must be deleted.
Right to Erasure (Right to be Forgotten)
Under GDPR, users have the right to request their personal data be deleted. Data systems must be able to implement this — including removing data from analytics pipelines and backups.
Data Residency
Data residency requirements mandate that data must be stored and processed in specific geographic regions. “Our German customers’ data must remain in EU regions — we can’t route it through US-based services.”
Useful Phrases
In governance reviews:
- “This dataset doesn’t have a documented owner — we need to assign a data steward before we can add it to the catalogue.”
- “The quality score for this table is 87% — it’s below our 95% threshold for tier-1 datasets. We need to remediate the completeness issues.”
When explaining lineage:
- “I can trace this revenue number back to the raw transactions table through three transformations — all documented in the catalogue.”
In access control discussions:
- “We apply column-level security to the salary field — only HR and Finance roles can see it.”
Practice
Deepen your data governance vocabulary with the Data Engineering exercise set and the Data Engineer learning path.