Language Models as Continuous Self-Evolving Data Engineers (2412.15151v3)

Published 19 Dec 2024 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable capabilities on various tasks, while the further evolvement is limited to the lack of high-quality training data. In addition, traditional training approaches rely too much on expert-labeled data, setting a ceiling on the performance of LLMs. To address this issue, we propose a novel paradigm named LANCE (LLMs as Continuous self-Evolving data engineers) that enables LLMs to train themselves by autonomously generating, cleaning, reviewing, and annotating data with preference information. Our approach demonstrates that LLMs can serve as continuous self-evolving data engineers, significantly reducing the time and cost of the post-training data construction. Through iterative fine-tuning on Qwen2 series models, we validate the effectiveness of LANCE across various tasks, showing that it can maintain high-quality data generation and continuously improve model performance. Across multiple benchmark dimensions, LANCE results in an average score enhancement of 3.64 for Qwen2-7B and 1.75 for Qwen2-7B-Instruct. This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities. Codes are available at: https://github.com/Control-derek/LANCE.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Language Models as Continuous Self-Evolving Data Engineers (2412.15151v3)

Summary

Follow-up Questions

Authors (8)

Don't miss out on important new AI/ML research

Language Models as Continuous Self-Evolving Data Engineers (2412.15151v3)

Summary

Follow-up Questions

Related Papers

Authors (8)

Don't miss out on important new AI/ML research