Regurgitative Training: The Value of Real Data in Training Large Language Models (2407.12835v2)
Abstract: What happens if we train a new LLM using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by other LLMs in a machine translation task, we find strong evidence that regurgitative training clearly handicaps the performance of LLMs. The same performance loss of regurgitative training is observed on transformer models that we train from scratch. We find suggestive evidence that the performance disadvantage of regurgitative training can be attributed to at least two mechanisms: (1) higher error rates and (2) lower lexical diversity in LLM-generated data as compared to real data. Based on these mechanisms, we propose and evaluate three different strategies to mitigate the performance loss of regurgitative training. First, we devise data-driven metrics to gauge the quality of each LLM-generated data instance, and then carry out an ordered training process where high-quality data are added before low-quality ones. Second, we combine data generated by multiple different LLMs (as an attempt to increase lexical diversity). Third, we train an AI detection classifier to differentiate between LLM- and human-generated data, and include LLM-generated data in the order of resemblance to human-generated data. All three strategies can improve the performance of regurgitative training to some extent but are not always able to fully close the gap from training with real data. Our results highlight the value of real, human-generated data in training LLMs, which cannot be easily substituted by synthetic, LLM-generated data.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Self-consuming generative models go mad. arXiv preprint arXiv:2307.01850.
- Homogenization effects of large language models on human creative ideation. arXiv preprint arXiv:2402.01536.
- Anthropic, A. (2024). The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card.
- On the stability of iterative retraining of generative models on their own data. arXiv preprint arXiv:2310.00429.
- Biever, C. (2023). Chatgpt broke the turing test-the race is on for new ways to assess ai. Nature, 619(7971):686–689.
- Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Large language model in creative work: The role of collaboration modality and user expertise. Available at SSRN 4575598.
- Ensemble fine-tuned mbert for translation quality estimation. arXiv preprint arXiv:2109.03914.
- Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Generative artificial intelligence enhances creativity. Available at SSRN.
- Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
- Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data. arXiv preprint arXiv:2404.01413.
- Diversity in machine learning. Ieee Access, 7:64323–64350.
- Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79–86, Phuket, Thailand.
- Methods to estimate large language model confidence. arXiv preprint arXiv:2312.03733.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
- You need only uncertain answers: Data efficient multilingual question answering. TWorkshop on Uncertainty and Ro-Bustness in Deep Learning.
- Mm1: Methods, analysis & insights from multimodal llm pre-training. arXiv preprint arXiv:2403.09611.
- Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
- Analyzing the effectiveness and applicability of co-training. In Proceedings of the ninth international conference on Information and knowledge management, pages 86–93.
- Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381(6654):187–192.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Does writing with language models reduce content diversity? arXiv preprint arXiv:2309.05196.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- A survey of semi-supervised learning methods. In 2008 International conference on computational intelligence and security, volume 2, pages 30–34. IEEE.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Scudder, H. (1965). Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Information Theory, 11(3):363–371.
- Sejnowski, T. J. (2023). Large language models and the reverse turing test. Neural computation, 35(3):309–342.
- Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences. arXiv preprint arXiv:2404.12272.
- The curse of recursion: Training on generated data makes models forget. arXiv preprint arXiv:2305.17493.
- A shocking amount of the web is machine translated: Insights from multi-way parallelism. arXiv preprint arXiv:2401.05749.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information systems, 42:245–284.
- Turing, A. M. (1950). I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind, LIX(236):433–460.
- Attention is all you need. Advances in neural information processing systems, 30.
- Vert, J.-P. (2023). How will generative ai disrupt data science in drug discovery? Nature Biotechnology, 41(6):750–751.
- Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. arXiv preprint arXiv:2306.07899.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
- Unsupervised data augmentation for consistency training. Advances in neural information processing systems, 33:6256–6268.
- Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10687–10698.
- A survey on detection of llms-generated content. arXiv preprint arXiv:2310.15654.
- Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541.
- Knowing more about questions can help: Improving calibration in question answering. arXiv preprint arXiv:2106.01494.
- Generative artificial intelligence, human creativity, and art. PNAS nexus, 3(3):pgae052.
- Texygen: A benchmarking platform for text generation models. In The 41st international ACM SIGIR conference on research & development in information retrieval, pages 1097–1100.
- Jinghui Zhang (11 papers)
- Dandan Qiao (2 papers)
- Mochen Yang (4 papers)
- Qiang Wei (19 papers)