Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

What Matters in LLM-generated Data: Diversity and Its Effect on Model Fine-Tuning (2506.19262v1)

Published 24 Jun 2025 in cs.CL and cs.LG

Abstract: With the remarkable generative capabilities of LLMs, using LLM-generated data to train downstream models has emerged as a promising approach to mitigate data scarcity in specific domains and reduce time-consuming annotations. However, recent studies have highlighted a critical issue: iterative training on self-generated data results in model collapse, where model performance degrades over time. Despite extensive research on the implications of LLM-generated data, these works often neglect the importance of data diversity, a key factor in data quality. In this work, we aim to understand the implications of the diversity of LLM-generated data on downstream model performance. Specifically, we explore how varying levels of diversity in LLM-generated data affect downstream model performance. Additionally, we investigate the performance of models trained on data that mixes different proportions of LLM-generated data, which we refer to as synthetic data. Our experimental results show that, with minimal distribution shift, moderately diverse LLM-generated data can enhance model performance in scenarios with insufficient labeled data, whereas highly diverse generated data has a negative impact. We hope our empirical findings will offer valuable guidance for future studies on LLMs as data generators.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuchang Zhu (12 papers)
  2. Zhonghua zhen (1 paper)
  3. Qunshu Lin (11 papers)
  4. Haotong Wei (3 papers)
  5. Xiaolong Sun (5 papers)
  6. Zixuan Yu (2 papers)
  7. Minghao Liu (44 papers)
  8. Zibin Zheng (194 papers)
  9. Liang Chen (360 papers)