Practice generative model vocabulary for synthetic data: GANs for tabular data, VAE latent space, conditional generation, Wasserstein GAN, and statistical similarity testing.
0 / 5 completed
1 / 5
'The GAN generates realistic tabular data.' How does a GAN (Generative Adversarial Network) work?
A GAN consists of two competing neural networks: the generator (creates synthetic data) and the discriminator (tries to tell real from fake). Training is adversarial — the generator improves to fool the discriminator, the discriminator improves to catch fakes. Eventually the generator produces data that closely matches the real distribution. CTGAN and TVAE are popular GAN variants for tabular data.
2 / 5
'The VAE's latent space captures the data distribution.' What is a Variational Autoencoder (VAE)?
A VAE is a generative model with two parts: an encoder that maps input data to a distribution in a lower-dimensional latent space, and a decoder that samples from that distribution to reconstruct or generate new data. The latent space is continuous and structured, making it useful for generating diverse but realistic synthetic samples — including for tabular data.
3 / 5
What does 'conditional generation (generate samples for class X)' mean?
Conditional generation allows you to control what the generative model produces. Instead of sampling from the full data distribution, you condition on a label or attribute (e.g., class=fraud, age_group=18-25) and generate samples that have those specific characteristics. This is especially useful for augmenting minority classes or generating targeted test cases.
4 / 5
'The Wasserstein GAN training is more stable.' What problem does Wasserstein GAN (WGAN) solve?
Standard GAN training is notoriously unstable — the discriminator can become too good, giving the generator useless gradients (vanishing gradient problem), or the generator collapses to producing only a few samples (mode collapse). WGAN uses the Wasserstein (Earth Mover's) distance as a loss function, which provides a smoother, more informative gradient signal even when the distributions are far apart — leading to more stable training.
5 / 5
'The generated data passes the statistical similarity test.' What does this test verify?
Statistical similarity tests compare the synthetic data to real data across multiple dimensions: marginal distributions of each feature (e.g., Kolmogorov-Smirnov test), pairwise correlations, higher-order statistics, and sometimes downstream ML utility (train on synthetic, test on real). Passing these tests gives confidence that models trained on synthetic data will generalise to real data.