Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data (2404.01413v2)

Published 1 Apr 2024 in cs.LG, cs.AI, cs.CL, cs.ET, and stat.ML

Abstract: The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed that such loops would lead to a phenomenon termed model collapse, under which performance progressively degrades with each model-data feedback iteration until fitted models become useless. However, those studies largely assumed that new data replace old data over time, where an arguably more realistic assumption is that data accumulate over time. In this paper, we ask: what effect does accumulating data have on model collapse? We empirically study this question by pretraining sequences of LLMs on text corpora. We confirm that replacing the original real data by each generation's synthetic data does indeed tend towards model collapse, then demonstrate that accumulating the successive generations of synthetic data alongside the original real data avoids model collapse; these results hold across a range of model sizes, architectures, and hyperparameters. We obtain similar results for deep generative models on other types of real data: diffusion models for molecule conformation generation and variational autoencoders for image generation. To understand why accumulating data can avoid model collapse, we use an analytically tractable framework introduced by prior work in which a sequence of linear models are fit to the previous models' outputs. Previous work used this framework to show that if data are replaced, the test error increases with the number of model-fitting iterations; we extend this argument to prove that if data instead accumulate, the test error has a finite upper bound independent of the number of iterations, meaning model collapse no longer occurs.

PDF HTML Abstract

Accumulating Data: A Strategy to Prevent Model Collapse in Generative Models

Introduction

The advancement of generative models has introduced a paradigm where models are often trained on a composite of real and synthetic data. This practice raises the question of the potential for model collapse when models are trained iteratively on their own outputs. Model collapse, characterized by the progressive degradation of model performance, poses a significant challenge to the sustainability of model training practices. Recent literature predominantly considers scenarios where new data replace previous iterations' data, neglecting the more realistic setting of data accumulation over time. In this paper, we examine the effects of data accumulation on model collapse by presenting theoretical proofs and empirical findings across different model types and data modalities. Our results establish that unlike the replacement strategy, accumulating data significantly mitigates the risk of model collapse.

Theoretical Foundations: Linear Regression Models

Our exploration begins with a theoretically tractable scenario involving a sequence of linear regression models. Previous studies indicated that model collapse is inevitable when new data replace the old, causing linear growth in test error with each iteration. Contrary to these findings, our theoretical analysis demonstrates that allowing data to accumulate—where each iteration's data contribute to a growing dataset—results in a finite upper bound on the test error, irrespective of iteration count. Specifically, for isotropic features, we show that the test error is bounded above by a factor that is independent of the number of iterations. This pivotal finding suggests that data accumulation can serve as a potent mechanism to curb model collapse.

Empirical Validation Across Model Types

To validate our theoretical insights, we conducted extensive experiments across various generative models and data types:

LLMs: We trained a cascade of transformer-based LLMs on text data, observing that the replacement of data across iterations precipitated model collapse, as evidenced by increasing test cross-entropy. Conversely, data accumulation not only arrested this decline but in some instances improved model performance.
Diffusion Models on Molecular Data: Applying our theory to the domain of molecule generation, we observed congruent outcomes with diffusion models, where data accumulation consistently outperformed the replacement strategy in maintaining model efficacy.
Image Generation with Variational Autoencoders (VAEs): In the field of image generation, training VAEs with accumulated data significantly tempered the escalation of test error, underscoring the universal applicability of our findings.

Implications and Future Directions

The implications of this research are twofold. Practically, our findings endorse the accumulation of data as a strategy to ensure the longevity and reliability of generative models trained on web-scale data, mitigating the risks of model collapse. Theoretically, this work extends our understanding of model-data feedback loops, challenging prior assumptions and spotlighting the resilience imparted by data accumulation strategies.

Looking ahead, this research opens avenues for further exploration into optimized data accumulation strategies, the dynamics of model bias in accumulated datasets, and the extension of these principles to other model architectures and training paradigms. As the boundary between real and synthetic data continues to blur, ensuring the robustness of generative models becomes paramount, with data accumulation emerging as a key strategy in this endeavor.

PDF Markdown Bookmark Chat (Pro)

References (29)

Authors (14)

Matthias Gerstgrasser (11 papers)
Rylan Schaeffer (33 papers)
Apratim Dey (8 papers)
Rafael Rafailov (37 papers)
Henry Sleight (10 papers)
John Hughes (32 papers)
Tomasz Korbak (24 papers)
Rajashree Agrawal (6 papers)
Dhruv Pai (6 papers)
Andrey Gromov (49 papers)
Daniel A. Roberts (22 papers)
Diyi Yang (151 papers)
David L. Donoho (25 papers)
Sanmi Koyejo (110 papers)

Citations (31)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/RylanSchaeffer/status/1816535790534701304

https://twitter.com/emollick/status/1785846070603350417

https://twitter.com/rm_rafailov/status/1785761651339506083

https://twitter.com/technollama/status/1848711622467956851

https://twitter.com/dcalacci/status/1817745779714298179

https://twitter.com/canaesseth/status/1911058888360624457