Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration (2410.08102v3)

Published 10 Oct 2024 in cs.CL

Abstract: Efficient data selection is crucial to accelerate the pretraining of LLM (LMs). While various methods have been proposed to enhance data efficiency, limited research has addressed the inherent conflicts between these approaches to achieve optimal data selection for LM pretraining. To tackle this problem, we propose a multi-actor collaborative data selection mechanism: each data selection method independently prioritizes data based on its criterion and updates its prioritization rules using the current state of the model, functioning as an independent actor for data selection; and a console is designed to adjust the impacts of different actors at various stages and dynamically integrate information from all actors throughout the LM pretraining process. We conduct extensive empirical studies to evaluate our multi-actor framework. The experimental results demonstrate that our approach significantly improves data efficiency, accelerates convergence in LM pretraining, and achieves an average relative performance gain up to $10.5\%$ across multiple LLM benchmarks compared to the state-of-the-art methods.

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (12)

Tweets

https://twitter.com/JagersbergKnut/status/1930960348971991387

https://twitter.com/arXivGPT/status/1846271601114824728

Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration (2410.08102v3)

Summary

Follow-up Questions

Related Papers

Authors (12)

Tweets