Strong Model Collapse (2410.04840v2)

Published 7 Oct 2024 in cs.LG and stat.ML

Abstract: Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised regression setting and establish the existance of a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1\% of the total training dataset) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training LLMs, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on LLMs and feed-forward neural networks for images.

PDF HTML Abstract

Strong Model Collapse: An Analytical Perspective

The paper "Strong Model Collapse" explores a critical issue within the domain of large neural networks, particularly concerning model performance degradation due to synthetic data involvement in training datasets. This paper offers a rigorous exploration of the model collapse phenomenon within the scaling laws paradigm, providing both theoretical insights and empirical validations.

Key Findings

Existence of Model Collapse: The researchers confirm the occurrence of a severe form of model collapse, where even a minimal inclusion of synthetic data (as low as 1%) leads to a significant deterioration in performance. This phenomenon contravenes the expectation that larger datasets would inherently enhance model performance.
Impact of Model Size: An intriguing revelation from this work is the characterization of model size as a double-edged sword. In scenarios where neural networks are approximated via random projections, larger models exacerbate model collapse. However, beyond a certain interpolation threshold, larger models might indeed alleviate some collapse effects but do not completely avert them.
Methodology: The paper employs a combination of theoretical analysis and empirical verification, focusing on linear regression models and neural networks. Techniques from operator-valued free probability theory are adeptly used to derive deterministic equivalents of error terms, revealing the underlying mechanisms of model collapse.
Experimental Insights: The theoretical propositions are substantiated through experiments on LLMs and neural networks handling image data. This includes detailed studies on LLMs like ChatGPT and Llama, and feed-forward networks on datasets such as MNIST.

Implications

Theoretical Contribution: The paper advances the understanding of neural scaling laws by highlighting the nuanced effects of synthetic data usage. It cautions against simplistic data augmentation strategies, emphasizing that mere dataset size expansion does not guarantee improved performance when synthetic data is involved.
Practical Applications: For practitioners working with large-scale models, this research underscores the necessity for strategic data curation. Ensuring a high ratio of real to synthetic data becomes crucial, especially in systems where performance and accuracy are paramount.
Future Directions: This paper opens several avenues for future research, including exploring mitigation strategies for model collapse and investigating other factors such as different model architectures, activation functions, and training regimes. The use of Gaussian equivalents to extend findings to fully-trained networks signifies a promising research trajectory.

Conclusion

"Strong Model Collapse" is a significant contribution that systematically documents a critical vulnerability in the training of large neural models. By bridging empirical observations with theoretical insights, the paper not only confirms prior anecdotal evidence but also sets a foundation for developing robust AI systems resilient to model collapse.