How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse (2404.05090v1)

Published 7 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: The phenomenon of model collapse, introduced in (Shumailov et al., 2023), refers to the deterioration in performance that occurs when new models are trained on synthetic data generated from previously trained models. This recursive training loop makes the tails of the original distribution disappear, thereby making future-generation models forget about the initial (real) distribution. With the aim of rigorously understanding model collapse in LLMs, we consider in this paper a statistical model that allows us to characterize the impact of various recursive training scenarios. Specifically, we demonstrate that model collapse cannot be avoided when training solely on synthetic data. However, when mixing both real and synthetic data, we provide an estimate of a maximal amount of synthetic data below which model collapse can eventually be avoided. Our theoretical conclusions are further supported by empirical validations.

PDF Abstract

Insights into Model Collapse During Synthetic Data Training

The paper by Seddik et al. undertakes a rigorous theoretical examination of the degradation in performance known as "model collapse" that can occur when LLMs are recursively trained on synthetic data generated by previous iterations of models. This paper is motivated by the expanding utilization of LLMs that often generate synthetic data, potentially convoluting the subsequent rounds of model training when this synthetic data is incorporated into datasets. The authors delve into both fully synthetic training scenarios and situations where real and synthetic data are combined, providing theoretical results supported by empirical validations.

Key Contributions and Methodology

At the heart of this research is the formulation of a statistical model that allows for precise characterization of model collapse. The authors investigate the detrimental effect of training on purely synthetic data—termed the Fully Synthetic case—revealing that model collapse is inevitable in such circumstances. They demonstrate that when training loops solely on synthetic data, subsequent models lose significant informational diversity, which is articulated through the concept of total collapse: convergence to a distribution that consists of a Dirac measure (i.e., assigning all probability to a single token).

The paper identifies two pivotal components of model collapse: statistical approximation error and functional approximation error. The statistical approximation error is primarily due to limited information capture when models are trained on finite datasets. This paper provides a detailed analysis of these approximation errors through theoretical frameworks, including next-token-prediction models that simulate LLM behavior.

Results and Implications

Fully Synthetic Training: The analysis mathematically establishes that as one recursively trains models using only synthetic data generated from prior models, this 'self-consuming' loop inexorably leads to total collapse. The expected time to this complete degradation is proven to be strongly influenced by the sample size and initial distribution richness.
Partially Synthetic Training: The authors present a more pragmatic scenario where models are trained on a mixture of real and synthetic data. They successfully provide an upper threshold for synthetic data incorporation to avoid collapse, highlighting that the amount of synthetic data needs to be significantly lower than real data to maintain model integrity. Theoretical bounds on distribution deviation between generations offer guidelines for practitioners who wish to incorporate synthetic data in the training datasets.

Concluding Observations

The paper offers robust theoretical insights into the rate of collapse and provides a concrete statistical basis for evaluating the integration of synthetic data in model training pipelines. By pioneering this methodical approach, the authors offer a roadmap for mitigating collapse risks—a consideration that will become increasingly relevant as synthetic text generation tools proliferate.

Future Research Directions: This paper lays the groundwork for numerous extensions. Incorporating high-dimensional embeddings for addressing functional approximation issues and analyzing the impact of in-context learning on model collapse are promising avenues for further exploration. These extensions could yield even more sophisticated frameworks, enabling researchers to devise models that are resilient to the effects of synthetic training loops.

This paper is crucial for researchers focusing on the dynamics of recursive training in LLMs, particularly as the field grapples with maintaining model efficacy amid the burgeoning presence of synthetic data.