Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Theoretical Proof that Auto-regressive Language Models Collapse when Real-world Data is a Finite Set (2412.14872v3)

Published 19 Dec 2024 in cs.CL

Abstract: Auto-regressive LMs have been widely used to generate data in data-scarce domains to train new LMs, compensating for the scarcity of real-world data. Previous work experimentally found that LMs collapse when trained on recursively generated data. This paper presents a theoretical proof: once a corpus (such as a subset of the World Wide Web) begins to incorporate generated data and no new real-world data is added to the corpus, then no matter how small the amount of data each LM generates and contributes to the corpus, LM collapse is inevitable after sufficient time. This finding suggests that attempts to mitigate collapse by limiting the quantity of synthetic data in the corpus are fundamentally insufficient. Instead, avoiding collapse hinges on ensuring the quality of synthetic data.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.