Dice Question Streamline Icon: https://streamlinehq.com

Explain the inability to separate Transformer-generated synthetic data from real data in bottleneck embeddings

Determine why the bottleneck representations of both the multilayer perceptron (MLP)–based and convolutional neural network (CNN)–based diffusion models, when visualized via t-SNE, fail to separate Transformer-generated synthetic human genotype or haplotype embeddings from real training and test embeddings, despite quantitative evaluations such as Nearest Neighbour Adversarial Accuracy and classifier recovery rates indicating lower data quality for Transformer-generated samples. Ascertain the underlying factors causing this discrepancy between embedding-based separability and established quality metrics.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper evaluates four diffusion-based generators (Unet MLP, Unet CNN, Unet MLP+CNN, and Transformer) for producing synthetic whole-genome genotypes/haplotypes and assesses them using recovery rates for downstream classifiers and Nearest Neighbour Adversarial Accuracy (NNAA) for realism and privacy.

To further probe model behavior, the authors visualize bottleneck embeddings of different data sources using t-SNE. They observe that both MLP- and CNN-based diffusion models successfully capture structure distinguishing several synthetic sources from real data but unexpectedly struggle to separate Transformer-generated synthetic data from real training and test data in these embeddings.

This observation appears inconsistent with other quantitative assessments (NNAA and recovery rates), which suggest inferior quality of Transformer-generated data relative to other generators. The cause of this inconsistency is explicitly stated as unclear, leaving an unresolved question about the relationship between embedding-space separability and quantitative data quality metrics.

References

However, it is unclear why both models have trouble separating the Transformer based data from the train and test data. This would generally indicate that the Transformer based data is of high quality, which other metrics (NNAA, Recovery Rates) in this paper disagree with.

Generating Synthetic Genotypes using Diffusion Models (2412.03278 - Kenneweg et al., 4 Dec 2024) in Appendix, Analyzing the Reconstruction Error of Different Models (following Figure “TSNE of the MLP model embeddings” and “TSNE of the CNN model embeddings”)