The paper investigates the implications of training generative models, specifically LLMs (LLM), on data generated by previous iterations of similar models. The central finding is the discovery of a "model collapse," a degenerative process where models progressively lose the ability to represent the true underlying data distribution, with the tails of the distribution disappearing over time. This phenomenon is shown to occur in Gaussian Mixture Models (GMM), Variational Autoencoders (VAE), and LLMs, suggesting its ubiquity across learned generative models.
The authors identify two primary causes of model collapse:
- Statistical approximation error, which arises due to the finite number of samples used during training.
- Functional approximation error, stemming from limitations in the expressiveness of the function approximators used in the models.
The paper argues that access to the original data distribution is crucial for sustaining the benefits of training from large-scale data, especially for capturing low-probability events that are often relevant to marginalized groups and understanding complex systems. The authors propose that data about genuine human interactions with systems will be increasingly valuable in the presence of content generated by LLM in data crawled from the Internet.
The paper presents a theoretical analysis of model collapse, using simplified mathematical models to provide analytical expressions for quantities of interest. The analysis focuses on quantifying how different sources of error affect the overall approximation of the original distribution. The authors consider two cases: a discrete distribution in the absence of functional approximation error, and a single-dimensional Gaussian case that portrays how functional approximation error can compound with statistical error.
Key theoretical results include:
- Demonstration that for discrete distributions with exact approximation, model collapse arises solely due to statistical errors from the sampling step, leading to the eventual convergence to a delta function.
- Derivation of a lower bound on the risk, defined in terms of the Wasserstein distance from the true distribution, for a single-dimensional Gaussian. The risk diverges linearly with the number of generations, indicating that the sampling rate needs to increase superlinearly to maintain an accurate approximation of the original distribution.
The paper also presents empirical results that support the theoretical analysis. Specifically, the authors demonstrate model collapse in GMMs and VAEs trained from scratch, showing that the models progressively lose information about the tails of the distribution and converge to a distribution with very small variance.
In the context of LLMs, the paper investigates the effects of fine-tuning OPT-125m on data generated by previous iterations of the model. The results show that models trained on generated data exhibit degraded performance compared to models trained on original data. The generated data also exhibit longer tails, suggesting that the models are starting to misperceive reality based on errors introduced by their ancestors.
The authors conduct experiments with different training regimes, including training for 5 epochs with no original training data and training for 10 epochs with 10% of the original training data preserved. Both regimes lead to degraded performance, but the preservation of original data allows for better model fine-tuning and leads to only minor degradation of performance.
The paper also addresses the issue of repeating phrases in generated text, showing that explicitly encouraging models to produce non-repeating sequences does not curb the effects of model collapse.
The paper concludes by discussing the implications of model collapse for the long-term sustainability of LLM training. The authors emphasize the importance of preserving access to the original data source and distinguishing data generated by LLM from other data. They suggest that community-wide coordination may be necessary to ensure the provenance of content crawled from the Internet and to enable the training of newer versions of LLM without access to pre-LLM data or direct human-generated data.
In the theoretical analysis, the authors model the learning process with generational data as a stochastic process. At generation , the dataset consists of i.i.d. random variables , where and . The distribution of is denoted as , with representing the original distribution. The transition from generation to involves estimating the distribution of samples in with an approximation , where represents the functional approximation. The dataset is then resampled from the distribution , with non-negative parameters summing up to $1$.
For the single dimensional Gaussian case, the authors consider and estimate the sample mean and variance using:
- is the estimated sample mean at generation
- is the sample size at generation
- represents the samples at generation
- is the estimated sample variance at generation
They then derive the following expression for :
$X^n\_j = \mu + \frac{\sigma}{\sqrt{M\_0}Z^1} + \frac{\sigma}{\sqrt{M\_1}\sqrt{S^1}Z^2} + \dots + \frac{\sigma}{\sqrt{M\_{n-1}\sqrt{S^1\times\dots\times S^{n-1}Z^n+\sigma\sqrt{S^1\times\dots\times S^{n}Z^n\_j}$
- are random variables distributed as
- are random variables distributed as
They derive the following approximation:
The authors then use the Wasserstein-2 distance to measure the distance between the true distribution and the approximated distribution at step :
Finally, they calculate the risk as: