Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification (2406.07515v2)

Published 11 Jun 2024 in cs.LG, cs.AI, and stat.ML
Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Abstract: LLMs (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns about \emph{model collapse}, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks -- computing matrix eigenvalues with transformers and news summarization with LLMs -- which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.

A Critical Analysis of Synthesized Data and Model Collapse in LLMs

Synthesized data generated from LLMs is increasingly considered as a cost-effective alternative to human-annotated data for fine-tuning and downstream applications. Nonetheless, concerns have emerged about "model collapse," a phenomenon where models fine-tuned on such synthetic data exhibit performance degradation. This paper addresses the critical question of whether feedback from human or machine verifiers can mitigate this risk and sustain model performance.

Theoretical Foundations

Theoretical insights anchor the paper’s analysis, exploring the dynamics between synthesized data quality and subsequent model performance. The authors employ a Gaussian mixture model (GMM) under high-dimensional regimes to derive conditions for optimal classification performance. The primary focus lies in introducing and analyzing a reinforcement process, modeled as a pruning strategy applied to synthesized data. Two fundamental parameters—ϕ\phi (correct label reinforcement rate) and ψ\psi (incorrect label reinforcement rate)—are instrumental in this theoretical framework.

In this context, the paper identifies a critical threshold p=1/(1+ψ/ϕ)p_\star = 1/(1+\psi/\phi), beyond which the probability of model success or failure shifts abruptly. When the error rate pp of the synthesized data generator is lower than pp_\star, the model achieves asymptotically optimal performance. Conversely, as pp surpasses pp_\star, the model undergoes detrimental performance collapse.

Empirical Validation

The paper transitions from theoretical analysis to empirical validation through simulations and real-world experiments. These practical assessments underscore the pivotal role of effective reinforcement. The experiments span two primary tasks: matrix eigenvalue prediction and news summarization with LLaMA-2.

Matrix Eigenvalue Prediction

In the context of mathematical tasks, the paper examines transformers trained to predict eigenvalues of matrices. Utilizing a naive beam search strategy, it becomes evident that even when the model generates accurate solutions, it lacks self-assessment capabilities. As beam sizes increase, external oracle verification significantly enhances the model's accuracy. The results reveal that, while model-generated solutions can be high quality, oracle supervision is essential for effective data selection.

News Summarization with LLaMA-2

Turning to natural language processing, the paper studies the impact of synthesized data on news summarization tasks using LLaMA-2-7B. Synthesized news summaries curated based on their Rouge scores—an oracle-based verification—exhibit substantial performance improvements. Conversely, self-selection or weak supervision (using LLaMA-3 for verification) fails to provide comparable benefits, further substantiating the importance of using effective external verifiers for data reinforcement.

Implications and Future Prospects

This paper provides several crucial insights:

  1. Model Collapse Mitigation: The theoretical and empirical results underscore that reinforcement-based data selection can effectively mitigate model collapse, provided the verifier is sufficiently competent.
  2. Verifier Quality: The quality and relevance of the verifier relative to the generator are critical. While a more accurate verifier generally improves performance, it must also align well with the generator (i.e., smaller θ\theta), emphasizing the need for correlation in their assessment capabilities.
  3. Scalability of Synthesized Data: Reinforced by empirical evidence, the paper suggests that synthesized data with oracle supervision can not only prevent model degradation but can also outperform models trained solely on original human-annotated data.

These findings carry significant implications for scaling AI systems. Reinforcement learning with human feedback (RLHF) and its machine-based analogs (RLAIF) can offer robust frameworks for leveraging synthesized data, extending the lifespan and efficacy of LLMs without succumbing to model collapse.

Conclusion

In summary, this paper advances our understanding of using synthesized data for training LLMs and highlights critical strategies to avoid model collapse. By employing theoretical models, reinforced simulations, and practical applications, it demonstrates that effective feedback mechanisms are essential for sustaining and enhancing model performance. These insights pave the way for more reliable and scalable AI systems, leveraging synthesized data without compromising on quality and accuracy.

This paper sets a significant precedent in the ongoing exploration of synthesized data and model performance, providing robust methodologies and practical guidelines for future developments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yunzhen Feng (11 papers)
  2. Elvis Dohmatob (35 papers)
  3. Pu Yang (21 papers)
  4. Julia Kempe (32 papers)
  5. Francois Charton (10 papers)