Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Stability of Iterative Retraining of Generative Models on their own Data (2310.00429v5)

Published 30 Sep 2023 in cs.LG and stat.ML

Abstract: Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.

Essay: On the Stability of Iterative Retraining of Generative Models on Their Own Data

The paper "On the Stability of Iterative Retraining of Generative Models on their Own Data" explores a critical issue that arises with the escalating proliferation of synthetic data generated by sophisticated deep generative models. As these models continue to fill the web with synthetic content, they inevitably face training datasets that contain both real and synthetic data. This paper constructs a theoretical and empirical framework to examine the implications of such mixed datasets on the performance and stability of generative models.

Overview

The paper begins by noting the substantial progress achieved by deep generative models in producing high-quality data that convincingly simulates real data distributions. Crucial to these advancements are the massive datasets sourced from the internet, which will increasingly include data generated by previous iterations of such models. This feedback loop raises the question: how does the retraining of generative models on datasets augmented with synthetic data affect model performance?

To address this question, the authors propose a structured approach examining the iterative retraining process. They analyze the characteristics of generative models retrained in various conditions—ranging from datasets composed solely of real data to those with purely synthetic data—and develop a theoretical framework to demonstrate model stability in such contexts.

Theoretical Contributions

Central to the paper's contributions is the development of a model stability theorem. The paper proves that iterative retraining is stable if two conditions are met: the initial model must closely approximate the real data distribution, and the proportion of real data in subsequent training datasets must be sufficiently high. This theoretical finding is supported by constructing a stability framework based on maximum likelihood estimation objectives.

The researchers leverage mathematical techniques to delve into the behavior of these models under iterative retraining. Additionally, they prove the existence of fixed points that prevent the models from diverging or collapsing into suboptimal parameter configurations. The analysis accommodates various model architectures like VAEs, normalizing flows, and diffusion models, ensuring a comprehensive examination across different generative paradigms.

Empirical Validation

The paper extends its theoretical insights by empirically validating them using experiments with synthetic and natural image datasets, including CIFAR10 and FFHQ. The experiments reveal the nuances of retraining models using diffusing processes and normalizing flows when exposed to a blend of real and synthetic data.

These experiments substantiate the theoretical predictions, demonstrating that iterative retraining stabilizes when phrases of real data usage surpass certain thresholds. This balance ensures that models do not collapse under self-referential training, supporting the practical application of these findings in large-scale models prone to data contamination on the internet.

Implications and Future Directions

From a practical standpoint, the results suggest strategies for practitioners dealing with datasets that include synthetic elements. Importantly, the requirement to maintain a robust proportion of real data components becomes clear. The paper implies that the generative model community must be cautious about potential quality degradation due to synthetic data iterations.

Theoretically, the work lays the groundwork for further exploration of training dynamics in generative models. Future research could extend to exploring the boundedness conditions more deeply and addressing the complexities associated with real-world datasets that naturally blend diverse data types. Understanding the ethical and quality implications of synthetic data in AI systems remains an important consideration.

In conclusion, this paper provides a rigorous examination of a fundamental issue for deep generative models, setting a key direction for future explorations in iterative model refinement within mixed data contexts. The blend of theoretical guarantees and empirical insights offers a robust framework for ensuring the sustained efficacy of generative models in increasingly synthetic data-rich environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Self-consuming generative models go mad, 2023.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Data augmentation generative adversarial networks. 2018.
  4. T. D. Barfoot. Multivariate gaussian variational inference by natural gradient descent. arXiv preprint arXiv:2001.10025, 2020.
  5. Language models are few-shot learners. Advances in neural information processing systems, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  9. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  10. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367, 2018.
  11. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  12. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  13. Generalization error in deep learning. In Compressed Sensing and Its Applications: Third International MATHEON Conference 2017, pages 153–193. Springer, 2019.
  14. Understanding estimation and generalization error of generative adversarial networks. IEEE Transactions on Information Theory, 67(5):3114–3129, 2021.
  15. Feature likelihood score: Evaluating generalization of generative models using samples, 2023.
  16. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  17. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  18. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  19. D. P. Kingma and M. Welling. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  20. A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
  21. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019.
  22. S. Lang. Fundamentals of differential geometry. Springer Science & Business Media, 1999.
  23. Flow matching for generative modeling, Oct. 2022.
  24. Inequalities: theory of majorization and its applications. 1979.
  25. Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation? arXiv preprint arXiv:2303.01255, 2023a.
  26. Towards understanding the interplay of generative artificial intelligence and the internet. arXiv preprint arXiv:2306.06130, 2023b.
  27. Midjourney. https://www.midjourney.com/home/, 2023. Accessed: 2023-09-09.
  28. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023.
  29. Performative prediction with neural networks, 2023.
  30. Diffusion models are minimax optimal distribution estimators. arXiv preprint arXiv:2303.01861, 2023.
  31. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  32. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.
  33. Performative prediction, 2021.
  34. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 2021.
  35. D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  36. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  37. Assessing generative models via precision and recall. Advances in neural information processing systems, 31, 2018.
  38. LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 2022.
  39. The curse of recursion: Training on generated data makes models forget, 2023.
  40. Score-based generative modeling through stochastic differential equations. ICLR, 2021.
  41. Stability AI. https://stability.ai/stablediffusion, 2023. Accessed: 2023-09-09.
  42. J. Steinhardt. AI Forecasting: One Year In . https://bounded-regret.ghost.io/ai-forecasting-one-year-in/, 2022. Accessed: 2023-09-09.
  43. Coupling-based invertible neural networks are universal diffeomorphism approximators. Advances in Neural Information Processing Systems, 33:3362–3373, 2020.
  44. Improving and generalizing flow-based generative models with minibatch optimal transport. arXiv preprint 2302.00482, 2023.
  45. C. Villani et al. Optimal transport: old and new, volume 338. Springer, 2009.
  46. De novo design of protein structure and function with rfdiffusion. Nature, pages 1–3, 2023.
  47. H. Yang and W. E. Generalization error of gan from the discriminator’s perspective, 2021a.
  48. H. Yang and W. E. Generalization and memorization: The bias potential model, 2021b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Quentin Bertrand (20 papers)
  2. Avishek Joey Bose (29 papers)
  3. Alexandre Duplessis (6 papers)
  4. Marco Jiralerspong (9 papers)
  5. Gauthier Gidel (76 papers)
Citations (31)