How to Discuss Data Quality in English
Learn the vocabulary and phrases data engineers and analysts use to discuss, measure, and improve data quality in English-speaking technical teams.
Poor data quality is one of the most expensive and frustrating problems in software organisations. When analysts and engineers cannot trust the data, decisions slow down, trust erodes, and teams spend their time debugging pipelines instead of generating insights. Knowing how to discuss data quality clearly in English — how to name the problems, describe their impact, and propose solutions — is a valuable skill for any data professional.
Key Vocabulary
Data lineage Data lineage describes the origin of data and the journey it takes through a pipeline — every transformation, join, and aggregation it passes through before reaching its destination. Understanding lineage helps engineers trace the source of quality issues. Example: “We traced the data lineage and found that the null values in the revenue column were introduced by a join condition in the transformation step.”
Data freshness Data freshness refers to how up-to-date a dataset is relative to the source system. Stale data can be as damaging as inaccurate data, particularly for time-sensitive decisions. Example: “The dashboard shows data that is 18 hours old — the freshness expectation was six hours. We need to investigate why the pipeline is running late.”
Schema drift Schema drift occurs when the structure of an incoming dataset changes unexpectedly — for example, when a new column is added, an existing column is renamed, or a data type changes. Schema drift is a common source of pipeline failures. Example: “The pipeline failed this morning because of schema drift — the upstream team added a required column to the events table without notifying us.”
Data contract A data contract is a formal agreement between a data producer and a data consumer that specifies the expected schema, semantics, and quality guarantees of a dataset. Data contracts are increasingly used to prevent schema drift and data quality issues. Example: “We’ve introduced data contracts for all our critical upstream sources — any changes to the schema must be agreed upon before deployment.”
Completeness
Completeness measures the extent to which all expected data is present. A dataset is complete when no required records or fields are missing.
Example: “The completeness check flagged that 3% of records are missing the customer_email field, which is required for the marketing segment.”
Common Scenarios Where This Language Is Used
In a data quality incident review:
“We identified a data quality issue in the orders dataset. The total_amount field was returning negative values for approximately 2% of records, which was caused by a bug in the refund calculation logic introduced in last week’s release.”
In a planning meeting for data quality improvements: “I propose we implement automated data quality checks at the ingestion layer. We should monitor for completeness, schema drift, and freshness on all our critical datasets. I can set this up in Great Expectations by the end of the sprint.”
When reporting data quality metrics to stakeholders: “Last month, our data quality score across all monitored datasets averaged 97.3%. The main areas where we fell below our targets were freshness on the customer profile dataset and completeness on the transaction log.”
Useful Phrases for Data Quality Discussions
- “We’ve detected a data quality anomaly in the sales pipeline — the record count is 40% lower than yesterday.”
- “The freshness SLA for this dataset is four hours, but it hasn’t been updated in 12 hours.”
- “Schema drift caused the pipeline to fail — the upstream team added a non-nullable column without warning.”
- “We have automated checks for completeness, uniqueness, and referential integrity on all critical datasets.”
- “The data lineage shows that the issue originated in the source system, not in our pipeline.”
- “We need to establish a data contract with the upstream team to prevent this kind of breaking change.”
- “The duplicate records in this table suggest a deduplication step is missing from the ingestion process.”
- “Our data quality dashboard shows a confidence score for each dataset — anything below 95% triggers an alert.”
- “Before using this dataset in a model, we should validate that the distribution matches our expectations.”
- “The data quality issue has been resolved. The root cause was an incorrect timezone conversion in the ETL script.”
Dimensions of Data Quality
When discussing data quality formally, it helps to reference the standard dimensions. These are widely recognised and give your conversations more structure:
Accuracy: Does the data correctly represent the real-world entity it describes? Completeness: Are all expected records and fields present? Consistency: Is the data consistent across different systems or representations? Timeliness: Is the data available when it is needed? Uniqueness: Are there duplicate records where there should not be? Validity: Does the data conform to defined formats, ranges, and rules?
Using these dimensions helps you be precise: “The issue is a validity problem — the postal codes in this column don’t match the expected UK format” is more actionable than “the data looks wrong.”
Practice Suggestion
Think of a data quality issue you have encountered at work or in a personal project. Write a short incident description (200 words) in English that covers: what the issue was, how it was discovered, which dimension of data quality it affected, what the root cause was, and what was done to fix and prevent it. Use at least four of the vocabulary terms from this post. This format mirrors what you would write in a real data quality incident report.