Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data (2404.01413v2)

Published 1 Apr 2024 in cs.LG, cs.AI, cs.CL, cs.ET, and stat.ML
Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data

Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of LLMs on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs.

Accumulating Data: A Strategy to Prevent Model Collapse in Generative Models

Introduction

The advancement of generative models has introduced a paradigm where models are often trained on a composite of real and synthetic data. This practice raises the question of the potential for model collapse when models are trained iteratively on their own outputs. Model collapse, characterized by the progressive degradation of model performance, poses a significant challenge to the sustainability of model training practices. Recent literature predominantly considers scenarios where new data replace previous iterations' data, neglecting the more realistic setting of data accumulation over time. In this paper, we examine the effects of data accumulation on model collapse by presenting theoretical proofs and empirical findings across different model types and data modalities. Our results establish that unlike the replacement strategy, accumulating data significantly mitigates the risk of model collapse.

Theoretical Foundations: Linear Regression Models

Our exploration begins with a theoretically tractable scenario involving a sequence of linear regression models. Previous studies indicated that model collapse is inevitable when new data replace the old, causing linear growth in test error with each iteration. Contrary to these findings, our theoretical analysis demonstrates that allowing data to accumulate—where each iteration's data contribute to a growing dataset—results in a finite upper bound on the test error, irrespective of iteration count. Specifically, for isotropic features, we show that the test error is bounded above by a factor that is independent of the number of iterations. This pivotal finding suggests that data accumulation can serve as a potent mechanism to curb model collapse.

Empirical Validation Across Model Types

To validate our theoretical insights, we conducted extensive experiments across various generative models and data types:

  • LLMs: We trained a cascade of transformer-based LLMs on text data, observing that the replacement of data across iterations precipitated model collapse, as evidenced by increasing test cross-entropy. Conversely, data accumulation not only arrested this decline but in some instances improved model performance.
  • Diffusion Models on Molecular Data: Applying our theory to the domain of molecule generation, we observed congruent outcomes with diffusion models, where data accumulation consistently outperformed the replacement strategy in maintaining model efficacy.
  • Image Generation with Variational Autoencoders (VAEs): In the field of image generation, training VAEs with accumulated data significantly tempered the escalation of test error, underscoring the universal applicability of our findings.

Implications and Future Directions

The implications of this research are twofold. Practically, our findings endorse the accumulation of data as a strategy to ensure the longevity and reliability of generative models trained on web-scale data, mitigating the risks of model collapse. Theoretically, this work extends our understanding of model-data feedback loops, challenging prior assumptions and spotlighting the resilience imparted by data accumulation strategies.

Looking ahead, this research opens avenues for further exploration into optimized data accumulation strategies, the dynamics of model bias in accumulated datasets, and the extension of these principles to other model architectures and training paradigms. As the boundary between real and synthetic data continues to blur, ensuring the robustness of generative models becomes paramount, with data accumulation emerging as a key strategy in this endeavor.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850, 2023.
  3. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):185, 2022.
  4. On the stability of iterative retraining of generative models on their own data. arXiv preprint arXiv:2310.00429, 2023.
  5. Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822, 2023.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
  8. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. Advances in Neural Information Processing Systems, 34:10131–10143, 2021.
  9. Model collapse demystified: The case of regression. arXiv preprint arXiv:2402.07712, 2024.
  10. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
  11. Will large-scale generative models corrupt future datasets? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  20555–20565, 2023.
  12. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  13. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  14. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
  15. Combining generative artificial intelligence (ai) and the internet: Heading towards evolution or degradation? arXiv preprint arXiv:2303.01255, 2023a.
  16. Towards understanding the interplay of generative artificial intelligence and the internet. arXiv preprint arXiv:2306.06130, 2023b.
  17. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  18. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  19. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pp. 1278–1286. PMLR, 2014.
  20. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  21. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  22. The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493, 2023.
  23. Neural tangent kernel eigenvalues accurately predict generalization. 2021.
  24. Data feedback loops: Model-driven amplification of dataset biases. In International Conference on Machine Learning, pp. 33883–33920. PMLR, 2023.
  25. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  26. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  27. More than a toy: Random matrix models predict how real-world neural representations generalize. In International Conference on Machine Learning, pp. 23549–23588. PMLR, 2022.
  28. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  29. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Matthias Gerstgrasser (11 papers)
  2. Rylan Schaeffer (33 papers)
  3. Apratim Dey (8 papers)
  4. Rafael Rafailov (37 papers)
  5. Henry Sleight (10 papers)
  6. John Hughes (32 papers)
  7. Tomasz Korbak (24 papers)
  8. Rajashree Agrawal (6 papers)
  9. Dhruv Pai (6 papers)
  10. Andrey Gromov (49 papers)
  11. Daniel A. Roberts (22 papers)
  12. Diyi Yang (151 papers)
  13. David L. Donoho (25 papers)
  14. Sanmi Koyejo (110 papers)
Citations (31)
Youtube Logo Streamline Icon: https://streamlinehq.com