Intermediate–Advanced 12 terms

Synthetic Data

Vocabulary for generating, evaluating, and safely deploying synthetic datasets in machine learning and privacy-preserving data pipelines.

  • Synthetic Data /sɪnˈθetɪk ˈdeɪtə/

    Artificially generated data that mimics the statistical properties and structure of real-world data without containing actual records — used to augment training sets, preserve privacy, and test systems safely.

    "We replaced the real patient records in our ML training pipeline with synthetic data generated to match the original distribution — reducing GDPR compliance risk while preserving model accuracy within 1.2% of the baseline."
  • GAN (Generative Adversarial Network) /dʒiː eɪ en/

    A deep learning architecture comprising two neural networks — a generator that creates synthetic samples and a discriminator that tries to distinguish them from real data — trained in competition until the generator produces convincingly realistic outputs.

    "We trained a GAN on 50,000 medical images to generate synthetic training examples for rare pathologies — the discriminator could no longer distinguish real from synthetic at 94% accuracy, indicating the GAN had learned the underlying data distribution well enough for augmentation."
  • VAE (Variational Autoencoder) /viː eɪ iː/

    A generative model that encodes input data into a compressed latent space distribution and decodes samples from that distribution to produce new synthetic examples — providing smoother interpolation and more controllable generation than GANs.

    "We used a VAE to generate synthetic transaction records by sampling from the learned latent space of normal purchase behaviour — interpolating between known patterns to create plausible but non-existent transactions for fraud detection model training."
  • Data Augmentation /ˈdeɪtə ˌɔːɡmenˈteɪʃən/

    Techniques that artificially expand a training dataset by applying label-preserving transformations to existing examples — such as image flipping, cropping, noise injection, or text paraphrasing — reducing overfitting without collecting new real data.

    "Our image classifier was overfitting with 2,000 training samples — after applying data augmentation (horizontal flip, ±15° rotation, brightness jitter) we effectively expanded the dataset to 14,000 effective samples and reduced validation error by 18%."
  • Differential Privacy /ˌdɪfəˈrenʃəl ˈpraɪvəsi/

    A mathematical framework that adds calibrated statistical noise to query results or model outputs, guaranteeing that the presence or absence of any single individual's record cannot be inferred from the released data with high confidence.

    "We applied differential privacy with epsilon 1.0 when releasing the aggregate query results from our health survey — the added noise was sufficient to prevent re-identification while keeping 95% confidence intervals narrow enough to be statistically meaningful."
  • Privacy Budget (epsilon) /ˈpraɪvəsi ˈbʌdʒɪt ˈepzɪlɒn/

    In differential privacy, epsilon (ε) is the privacy loss parameter — smaller epsilon means stronger privacy protection but more noise and reduced utility. Once the total privacy budget is spent across queries, no further information can be released without exceeding the privacy guarantee.

    "We allocate a privacy budget of epsilon 2.0 per quarter for our aggregate analytics queries — each query consumes a portion of that budget, and once exhausted, additional queries are blocked until the next quarter's budget resets."
  • Fidelity (synthetic data) /fɪˈdelɪti/

    The degree to which synthetic data accurately reproduces the statistical properties of the original real data — including univariate distributions, correlations, temporal patterns, and rare event frequencies.

    "We measured fidelity using the Jensen-Shannon divergence between real and synthetic column distributions — our GAN-generated dataset achieved a fidelity score of 0.97 on the critical fraud indicator columns, confirming the synthetic data preserved the rare fraud signal patterns."
  • Utility (synthetic data) /juːˈtɪlɪti/

    The practical usefulness of synthetic data for its intended purpose — typically measured by how closely a model trained on synthetic data performs compared to one trained on real data. High utility is achieved when fidelity is high and the synthetic data preserves predictive signal.

    "Utility testing showed our synthetic dataset achieved 98.3% of the AUC of the real-data model on the holdout test set — we consider this sufficient utility for model pre-training, with a small fine-tuning step on a limited real sample before production deployment."
  • Membership Inference Attack /ˈmembəʃɪp ˈɪnfərəns əˈtæk/

    A privacy attack that attempts to determine whether a specific individual's record was part of a model's training dataset — by querying the model and exploiting the higher confidence it typically assigns to training examples.

    "Before releasing the synthetic dataset externally, we ran a membership inference attack simulation — the attacker's advantage was only 2.1% above random chance, indicating the generative model had not memorised individual training records."
  • SMOTE /sməʊt/

    Synthetic Minority Over-sampling Technique — an algorithm that generates synthetic examples of the minority class in an imbalanced dataset by interpolating between existing minority-class samples in feature space rather than simply duplicating them.

    "Our churn model was trained on data where only 3% of customers churned — SMOTE generated synthetic churn examples by interpolating between real churned-customer feature vectors, balancing the training set to 20% minority class and improving recall from 41% to 68%."
  • Data Masking /ˈdeɪtə ˈmɑːskɪŋ/

    A data protection technique that replaces sensitive real values (names, card numbers, SSNs) with realistic but fictional substitutes — preserving data format and referential integrity so masked data remains usable for development, testing, and analytics without exposing PII.

    "Our CI pipeline uses data masking to create test database snapshots — real customer names are replaced with plausible fictional names, credit card numbers with Luhn-valid fake numbers, and emails with @example.com addresses, so developers never touch real PII in non-production environments."
  • TSTR (Train on Synthetic, Test on Real) /tiː es tiː ɑːr/

    An evaluation protocol for synthetic data quality: train a model exclusively on synthetic data, then evaluate it on held-out real data. The gap between TSTR performance and a model trained on real data (TRTR) quantifies the practical utility of the synthetic dataset.

    "TSTR evaluation showed our synthetic tabular data achieved 96.1% of the TRTR AUC on the real test set — a gap of less than 4 percentage points indicates the synthetic data is suitable as a drop-in replacement for model development in privacy-restricted contexts."

Ready to practice?

Test your knowledge of these terms in the interactive exercise.

Start exercise →