Synthetic Data Vocabulary: Generation, Privacy, and Testing

Understand the English vocabulary used when discussing synthetic data generation, differential privacy, GAN-based approaches, and utility-privacy trade-offs in engineering teams.

Synthetic data is one of the fastest-growing areas in data engineering and ML — and it comes with a dense vocabulary borrowed from statistics, machine learning, and privacy engineering. If your team is building a data platform, working on ML training pipelines, or navigating GDPR constraints, you will encounter this terminology regularly.

What Is Synthetic Data?

Synthetic data is artificially generated data that mimics the statistical properties of real data without containing actual records about real individuals. It is used for model training when real data is scarce, for software testing when production data cannot be shared, and for privacy compliance when personal data must not leave a regulated environment.

There are several approaches to synthetic data generation:

Rule-based synthesis generates data according to explicit business rules — for example, generating realistic UK postcodes, phone numbers, or National Insurance numbers. It is fast and deterministic but cannot capture complex statistical relationships.

GAN-based synthesis uses a Generative Adversarial Network (GAN) — two neural networks (a generator and a discriminator) trained against each other. The generator learns to produce synthetic samples the discriminator cannot distinguish from real data. Engineers say: “We trained a tabular GAN on the customer transaction data — it captures the spending pattern correlations without exposing any real transactions.”

A Variational Autoencoder (VAE) is another deep learning approach — it learns a compressed representation (latent space) of the data distribution and samples from it to generate synthetic records.

Privacy Mechanisms

Differential privacy (DP) is a mathematical framework for adding controlled noise to data or algorithms, such that an individual’s presence in the dataset cannot be reliably inferred from the output. The key parameters are:

  • Epsilon (epsilon) — the privacy budget; lower epsilon means stronger privacy guarantees but more noise (less utility)
  • Privacy budget — the total amount of epsilon “spent” across all queries or operations on a dataset
  • Noise — the random perturbation added to achieve the differential privacy guarantee

Engineers debate: “At epsilon 1.0, the synthetic data utility is too degraded for the model to converge — can we justify epsilon 3.0 given the threat model?”

Data masking is a simpler technique that replaces sensitive values with realistic-looking substitutes (e.g., replacing a real name with a randomly generated name). Unlike synthetic data, masked data preserves the row-level structure of the original dataset. The distinction matters: “Masking is fine for QA environments, but for model training we need synthetic data that preserves cross-column correlations.”

Utility, Fidelity, and Testing

The utility-privacy trade-off is the central tension in synthetic data: more privacy protection (more noise, stronger anonymisation) typically means less useful data. Fidelity measures how closely the synthetic data matches the real data’s statistical properties — distributions, correlations, and cardinality. Utility measures whether the synthetic data is good enough for its intended purpose — training a model, running an integration test, or demonstrating a product.

Evaluation metrics include statistical similarity measures (KL divergence, Jensen-Shannon distance, Wasserstein distance) and downstream task performance (train on synthetic, test on real — TSTR).

In synthetic test data management, teams maintain libraries of synthetic datasets that mirror production schemas, used in CI pipelines and integration tests. A key challenge is referential integrity — ensuring foreign key relationships in synthetic data remain valid. A synthetic orders table must reference synthetic customers, not orphaned IDs.

Data augmentation is the related practice of expanding a real dataset with synthetic additions — common in computer vision (rotating and flipping images) and NLP (paraphrasing sentences).

Next Steps

If your team uses any synthetic data tooling — Faker, Gretel, Mostly AI, SDV — spend 30 minutes reading their documentation specifically for the privacy-related vocabulary. Then write a one-paragraph English description of your team’s current approach to test data, using at least five terms from this article. Precision in this vocabulary signals maturity in data engineering practice.