Explain the inability to separate Transformer-generated synthetic data from real data in bottleneck embeddings
Determine why the bottleneck representations of both the multilayer perceptron (MLP)–based and convolutional neural network (CNN)–based diffusion models, when visualized via t-SNE, fail to separate Transformer-generated synthetic human genotype or haplotype embeddings from real training and test embeddings, despite quantitative evaluations such as Nearest Neighbour Adversarial Accuracy and classifier recovery rates indicating lower data quality for Transformer-generated samples. Ascertain the underlying factors causing this discrepancy between embedding-based separability and established quality metrics.
References
However, it is unclear why both models have trouble separating the Transformer based data from the train and test data. This would generally indicate that the Transformer based data is of high quality, which other metrics (NNAA, Recovery Rates) in this paper disagree with.