Comprehensive Exploration of Synthetic Data Generation: A Survey (2401.02524v2)

Published 4 Jan 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Recent years have witnessed a surge in the popularity of Machine Learning (ML), applied across diverse domains. However, progress is impeded by the scarcity of training data due to expensive acquisition and privacy legislation. Synthetic data emerges as a solution, but the abundance of released models and limited overview literature pose challenges for decision-making. This work surveys 417 Synthetic Data Generation (SDG) models over the last decade, providing a comprehensive overview of model types, functionality, and improvements. Common attributes are identified, leading to a classification and trend analysis. The findings reveal increased model performance and complexity, with neural network-based approaches prevailing, except for privacy-preserving data generation. Computer vision dominates, with GANs as primary generative models, while diffusion models, transformers, and RNNs compete. Implications from our performance evaluation highlight the scarcity of common metrics and datasets, making comparisons challenging. Additionally, the neglect of training and computational costs in literature necessitates attention in future research. This work serves as a guide for SDG model selection and identifies crucial areas for future exploration.

PDF Abstract

Overview of Synthetic Data Generation: A Survey

The paper provides an extensive survey of Synthetic Data Generation (SDG) techniques, addressing a significant need in ML and deep learning (DL) fields where the scarcity of high-quality training data often hampers progress. It rigorously examines 417 SDG models developed over the last decade, highlighting model types, capabilities, and developments.

SDG Landscape

The survey identifies key trends in SDG, noting a clear evolution from simpler probabilistic models to more sophisticated neural network-based approaches. Notably, Generative Adversarial Networks (GANs) and neural networks are dominant, especially in computer vision, with diffusion models and transformers also gaining traction.

Key Findings

Model Complexity and Performance: There is an evident increase in complexity and capability, especially among neural network models, which surpass traditional probabilistic methods except in privacy-centric applications.
Highlight on Neural Networks: GANs have become the primary choice for image generation, with transformers and Recurrent Neural Networks (RNNs) commonly applied in sequential data, such as text and music.
Challenges in Evaluation: The field lacks standardized performance metrics and common datasets, complicating direct model comparisons. Evaluation varies widely in terms of benchmarks and criteria.

Technical Insights

GANs: Widely adopted in SDG, have shown effectiveness in generating realistic image data. The architecture of GANs allows them to outperform older models by learning complex, high-dimensional data distributions.
Diffusion Models: These models show promise in iterative noise reduction processes, refining data quality over multiple stages, thereby potentially improving on the quality offered by GANs.
Transformers in SDG: Although primarily known for advancements in NLP, transformers are now being adapted in SDG for their ability to learn complex dependencies in both spatial and sequential data.

Implications for Future Research

The survey indicates several directions for future research, emphasizing the need for:

Standardization: Developing universal evaluation metrics and datasets to enable fair comparison across models.
Efficiency: Addressing computational and training cost concerns, which are often neglected in contemporary literature due to the ever-growing model sizes.
Privacy: The necessity for methodologies considering privacy, as data generation for sensitive data like healthcare often needs to adhere to stringent privacy standards.

Practical Applications

The implications of this research are vast, spanning multiple domains:

Healthcare: Synthetic data generation enables innovations like privacy-preserving health record sharing.
Autonomous Systems: Enabling robust training data creation for autonomous vehicle systems.
Commerce and Finance: Facilitating secure and diverse data generation for financial models and simulations.

Conclusion

This comprehensive survey serves as an essential resource for both novice and seasoned researchers in SDG. It not only summarizes developments over the past decade but also guides model selection and illuminates paths for future exploration in AI and ML.