A Tale of Tails: Model Collapse as a Change of Scaling Laws (2402.07043v2)

Published 10 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: As AI model size grows, neural scaling laws have become a crucial tool to predict the improvements of large models when increasing capacity and the size of original (human or natural) training data. Yet, the widespread use of popular models means that the ecosystem of online data and text will co-evolve to progressively contain increased amounts of synthesized data. In this paper we ask: How will the scaling laws change in the inevitable regime where synthetic data makes its way into the training corpus? Will future models, still improve, or be doomed to degenerate up to total (model) collapse? We develop a theoretical framework of model collapse through the lens of scaling laws. We discover a wide range of decay phenomena, analyzing loss of scaling, shifted scaling with number of generations, the ''un-learning" of skills, and grokking when mixing human and synthesized data. Our theory is validated by large-scale experiments with a transformer on an arithmetic task and text generation using the LLM Llama2.

PDF HTML Abstract

Addressing Model Collapse in the Synthetic Data Age through Scaling Laws Analysis

Introduction

Generative AI models, particularly LLMs, have become foundational in synthesizing data that becomes part of training corpora for newer AI models. The increase in the use of synthetic data poses crucial questions about how it affects the scaling laws governing LLM improvements. This paper explores theoretical frameworks and empirical evidence to discuss the phenomenon known as "model collapse" in the regime of training AI on synthetic data, its implications, and possible mitigation strategies.

Model Collapse and Scaling Laws

Scaling laws predict how LLM performance metrics, such as errors or capabilities, scale with increases in model size, dataset size, or computational budget. However, the inclusion of AI-generated data into training sets may lead to "model collapse," where the predictive or generative quality of models sharply deteriorates. The paper meticulously studies various manifestations of this collapse, including:

The truncation or narrowing of the distribution of synthesized data
Loss of scaling laws
The un-learning of skills by models
Degeneration of model performance over successive data generations

A theoretical framework is developed to analyze these phenomena, with contributions substantiated by experiments involving LLMs and transformer models on tasks such as arithmetic and text generation.

Empirical Investigations

Experiments demonstrate a range of decay phenomena tied to the use of synthetically generated data, validating the theoretical framework proposed in the paper. For instance, training models on synthetic data leads to altered scaling laws, indicating a direct impact on the models' ability to learn and adapt to new data. Particularly notable is the discovery of a "Double Scaling Law" and a "Triplet Scaling Law" for models trained with a mix of human and AI-generated data. Such laws offer insights into how scaling behavior changes, including potential performance plateaus and declines across model generations.

Mitigating Model Collapse

The paper does not only outline problems but also proposes strategies to mitigate model collapse. One intriguing solution is the addition of a small proportion of real data to the training dataset containing synthetic data. This approach showcases a "grokking" phenomenon where models initially show a plateau in performance improvements, followed by a resumed progression, akin to traditional scaling laws when the proportion of real data reaches a certain threshold. Discussion includes thoughtful considerations on the selection of real data to ensure effectiveness and emphasizes the delicate balance required to optimize model performance in a mixed-data training environment.

Conclusion and Future Directions

The exploration into the impacts of synthetic data on LLM scaling laws opens a crucial dialogue on the sustainable development of AI models. The detailed theoretical analysis combined with empirical validations offer a nuanced understanding of model collapse, its drivers, and its potential remedies. Looking forward, the insights from this paper can guide the development of more resilient AI models, capable of leveraging the benefits of synthetic data without succumbing to the pitfalls of model collapse. Moreover, the paper highlights the increasing value of real, human-generated data and underscores the importance of strategic data curation and mixing strategies to future-proof the efficiency of scaling laws in AI training.

Acknowledgements

The research presented in this paper is supported by notable grants, and made possible through the collaboration and contribution of experts across multiple institutions, showcasing the collaborative effort needed to tackle complex challenges in AI research.