English for Data Quality Discussions: Talking About Validation and Trust

Vocabulary and phrases for data quality discussions in English: completeness, accuracy, freshness, anomalies, and diplomatic ways to flag bad data to stakeholders.

When a dashboard shows a number nobody believes, or a pipeline silently drops half its rows, the conversation that follows is a data quality discussion. Data engineers, analysts, and stakeholders need a shared, precise English to describe what is wrong, how bad it is, and whether to trust the numbers. This guide gives you the vocabulary and the diplomatic phrasing.


The dimensions of data quality

Data quality isn’t one thing — it has named dimensions. Using the right one makes you precise.

DimensionQuestion it answersExample problem
CompletenessIs all the data there?“20% of rows have a null country.”
AccuracyIs the data correct?”The totals don’t match the source.”
ConsistencyDoes it agree across systems?”CRM and warehouse disagree.”
Freshness / timelinessIs it up to date?”The feed is six hours stale.”
UniquenessAre there duplicates?”Each order appears twice.”
ValidityDoes it match the expected format?”Emails without an @.”

“The issue isn’t accuracy — the numbers are right — it’s freshness. The dashboard is showing stale data from this morning.”

Naming the exact dimension turns a vague “the data is bad” into an actionable statement.


Core vocabulary

TermMeaning
AnomalyA value outside the expected range
DriftData slowly changing distribution over time
Null / missingAbsent values
DuplicateA repeated record
OutlierAn extreme value
SchemaThe structure/shape of the data
ReconciliationChecking two datasets match
Ground truthThe trusted reference source

“We spotted an anomaly — a 10x spike in signups — but reconciliation against the source shows it’s a duplicate issue, not real growth.”


Describing how bad it is

Stakeholders need to know severity. Be specific and proportionate.

VaguePrecise
”The data is wrong.""About 3% of orders are missing a region — it skews the regional breakdown but not the total."
"It’s broken.""The pipeline dropped one partition, so yesterday is incomplete."
"Numbers look off.""Revenue is overstated by ~5% due to double-counted refunds.”

“To be clear on scope: this affects only the EU region table, roughly 3% of rows, and it doesn’t impact the global totals. Impact is limited to the regional drill-down.”

The phrases scope, impact is limited to, and doesn’t impact help stakeholders calibrate their worry.


Talking about whether to trust the data

The real question in these meetings is often “can I use this number in my report?” Answer it directly.

  • “I’d hold off on quoting that figure until we reconcile.”
  • “The headline number is solid; it’s the breakdown I don’t trust yet.”
  • “Treat today’s data as provisional.”
  • “I’d caveat that chart — it’s based on a partial load.”
  • “This is safe to use; the anomaly was cosmetic.”

“The total is trustworthy. I’d caveat the regional split as provisional until tomorrow’s reload confirms it.”

Provisional (temporary, subject to change) and caveat (a warning/qualification) are essential data-quality vocabulary.


Flagging bad data diplomatically

Often the bad data comes from another team’s pipeline. Raise it without blame.

BluntDiplomatic
”Your pipeline is broken.""I’m seeing something odd in the upstream feed — could we check it together?"
"You gave us wrong data.""There seems to be a mismatch between the source and what we’re receiving."
"This is your fault.""Looks like the schema changed upstream and we didn’t catch it.”

“Heads up — we’re seeing a mismatch between the upstream feed and the warehouse. It might be a schema change on your side that slipped through. Could we trace it together?”

Upstream (earlier in the data flow) and downstream (later) are core directional vocabulary. Problems “originate upstream” and “propagate downstream.”


Before and after: a full rewrite

Before (alarming, vague, blamey):

“the dashboard is totally wrong and the numbers are crazy, someone broke the data and we can’t trust anything. probably the upstream team.”

After (precise, calm, scoped):

“Quick flag on data quality: the EU revenue figure looks anomalous — about 10x normal. Reconciliation against the source shows it’s a duplicate issue, not real. Scope is limited to the EU regional table; the global total is unaffected and safe to use. The likely cause is a schema change upstream that broke our dedup step — I’ll confirm with the source team. Until we reload, please treat the EU breakdown as provisional.”


Common mistakes

  1. Saying “the data are/is” inconsistently. “Data” can be singular or plural; pick one and be consistent. In tech, singular (“the data is stale”) is now common and accepted.
  2. Confusing “accuracy” and “completeness.” Missing rows = completeness; wrong values = accuracy. Different fixes.
  3. Using “anomaly” for any problem. An anomaly is specifically an unexpected value, not a pipeline failure.
  4. Mixing up “upstream” and “downstream.” Upstream = source/earlier; downstream = consumers/later. Reversing these confuses everyone.
  5. Saying “duplicated data” when you mean “duplicate records.” “Duplicates” (noun) is cleaner.

Mini-glossary

  • Data contract — agreed schema/quality between producer and consumer
  • SLA on freshness — a promise on how current data will be
  • Backfill — reprocessing historical data
  • Dedup (deduplication) — removing duplicates
  • Quarantine — isolating bad records instead of failing
  • Lineage — the path data took from source to dashboard
  • Sanity check — a quick plausibility test

“Let’s add a sanity check to the pipeline — if row counts drop more than 20% day-over-day, quarantine the load and alert, rather than silently publishing.”


Key takeaways

  • Name the exact dimension: completeness, accuracy, consistency, freshness, uniqueness, validity.
  • Quantify scope and impact: “3% of rows, doesn’t affect totals.”
  • Tell stakeholders whether data is safe to use, provisional, or to be caveated.
  • Flag upstream issues with “mismatch” and “let’s trace it together,” not blame.

Data quality conversations are about trust. Speak precisely about what’s wrong and how much it matters, and people will trust both the data and you.