Autonomous Data Selection with Language Models for Mathematical Texts (2402.07625v3)

Published 12 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: To improve LLMs' proficiency in mathematical reasoning via continual pretraining, we introduce a novel strategy that leverages base LLMs for autonomous data selection. Departing from conventional supervised fine-tuning or trained classifiers with human-annotated data, our approach Autonomous Data Selection (AutoDS) utilizes meta-prompted LLMs as zero-shot verifiers to evaluate and select high-quality mathematical content autonomously. To demonstrate the efficacy of our method, we continuously pretrained a 7B-parameter LLM on our curated dataset, achieving substantial improvements in downstream performance on the MATH, GSM8K, and BIG-Bench Hard (BBH) tasks with a token amount reduced by orders of magnitude compared to previous continual pretraining works. Our method showcases a 2 times increase in pretraining token efficiency compared to state-of-the-art baselines, underscoring the potential of our approach in enhancing models' mathematical reasoning capabilities. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.

Authors (4)

Yifan Zhang (245 papers)
Yifan Luo (17 papers)
Yang Yuan (52 papers)
Andrew Chi-Chih Yao (16 papers)

Citations (17)

View on Semantic Scholar

Summary

Enhancing Mathematical Reasoning in AI through Autonomous Data Selection: Insights from AutoMathText

Introduction to AutoMathText

In the evolving landscape of LLMing, the capability to infuse domain-specific knowledge into AI systems represents a crucial frontier. This is particularly salient in fields where precise and accurate reasoning—such as in mathematics—is essential. The AutoMathText initiative offers an innovative methodology for autonomously curating high-quality mathematical texts for training purposes. By leveraging meta-prompted LLMs in a zero-shot capacity as verifiers, this approach sidesteps the need for supervised fine-tuning or trained classifiers reliant on human-annotated data for content evaluation.

Methodological Underpinnings

A primary contribution of the AutoMathText project is its pioneering use of base LLMs equipped with meta-prompts for autonomous data evaluation, challenging the traditional reliance on binary classification for data curation. This nuanced strategy is embodied in the formulation of a score function based on the softmax output over specific tokens, enabling a granular assessment of the mathematical quality and educational value of a diverse array of content. This process allows for a sophisticated data curation strategy that transcends rudimentary binary filtering, positing a potent framework for enhancing the mathematical reasoning capabilities of AI models without the extensive need for human intervention.

Empirical Validation

The efficacy of the AutoMathText approach is substantiated through comprehensive experimentation with the Mistral LLM, boasting 7 billion parameters. Significantly, when continuously pretrained on the AutoMathText dataset, this model exhibited notable improvements in downstream tasks, specifically on the MATH dataset. This was achieved with a token count reduction by orders of magnitude vis-a-vis previous pretraining efforts, underscoring the method's efficiency and the high-quality nature of the autonomously selected dataset.

Theoretical and Practical Implications

From a theoretical standpoint, the use of meta-prompted zero-shot verifiers for autonomous data selection represents a paradigm shift in the pretraining of LLMs for specialized tasks. This method not only enhances the pretraining process by focusing on the most informative data points but also introduces a scalable and unbiased mechanism for data evaluation, free from human biases that might influence the content selection process. Practically, the development and open sourcing of the AutoMathText dataset catalyze further advancements in the AI model's ability to comprehend and solve complex mathematical tasks, delineating a path toward the realization of more intelligent and autonomous learning systems.

Future Trajectories

While AutoMathText's current implementation focuses on mathematical reasoning, its underlying methodology holds promise for broader applications across various specialized domains. Future explorations could extend this autonomous data selection framework to fields beyond STEM, potentially encompassing literature, history, or even nuanced interdisciplinary studies. Such endeavors would not only broaden the implications of this work but also contribute to the evolution of AI systems capable of engaging with and contributing knowledge across a wide spectrum of human intellectual pursuits.

In sum, AutoMathText marks a significant step forward in the quest to enhance the domain-specific reasoning capabilities of AI systems. By marrying the intrinsic capabilities of LLMs with sophisticated autonomous content evaluation strategies, this approach sets a new benchmark for the development of intelligent systems adept in specialized fields such as mathematics. Moving forward, the broader implications of this methodology for autonomous data curation and model training in varied domains beckon further inquiry and exploration.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/AdeenaY8/status/1757411799228428311

https://twitter.com/_akhaliq/status/1757242565684818357

https://twitter.com/yifan_zhang_/status/1923026293823766909

https://twitter.com/knishimae0531/status/1757559067361616190

https://twitter.com/fly51fly/status/1757904100916908191

https://twitter.com/davesaunders/status/1757862780722454787