Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in Language Models (2502.17387v1)

Published 24 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Increasing interest in reasoning models has led math to become a prominent testing ground for algorithmic and methodological improvements. However, existing open math datasets either contain a small collection of high-quality, human-written problems or a large corpus of machine-generated problems of uncertain quality, forcing researchers to choose between quality and quantity. In this work, we present Big-Math, a dataset of over 250,000 high-quality math questions with verifiable answers, purposefully made for reinforcement learning (RL). To create Big-Math, we rigorously filter, clean, and curate openly available datasets, extracting questions that satisfy our three desiderata: (1) problems with uniquely verifiable solutions, (2) problems that are open-ended, (3) and problems with a closed-form solution. To ensure the quality of Big-Math, we manually verify each step in our filtering process. Based on the findings from our filtering process, we introduce 47,000 new questions with verified answers, Big-Math-Reformulated: closed-ended questions (i.e. multiple choice questions) that have been reformulated as open-ended questions through a systematic reformulation algorithm. Compared to the most commonly used existing open-source datasets for math reasoning, GSM8k and MATH, Big-Math is an order of magnitude larger, while our rigorous filtering ensures that we maintain the questions most suitable for RL. We also provide a rigorous analysis of the dataset, finding that Big-Math contains a high degree of diversity across problem domains, and incorporates a wide range of problem difficulties, enabling a wide range of downstream uses for models of varying capabilities and training requirements. By bridging the gap between data quality and quantity, Big-Math establish a robust foundation for advancing reasoning in LLMs.

Summary

Overview of Big-Math: A Large-Scale Math Dataset for Reinforcement Learning in LLMs

The paper "Big-Math: A Large-Scale, High-Quality Math Dataset for Reinforcement Learning in LLMs" presents a robust effort to address the paucity of quality and scalable datasets tailored for reinforcement learning (RL), specifically in the context of enhancing the reasoning capabilities of LLMs. By methodically curating a dataset with over 250,000 math problems, the paper seeks to bridge the existing gap between data quality and quantity—present in current datasets—essential for the advancement of RL-based training methodologies.

Background

The importance of math as a testing ground for evaluating and developing sophisticated reasoning tactics in LLMs has been underscored by various research endeavors. The paper notes that most existing datasets sacrifice either the quality or the quantity of data, which poses a significant bottleneck in RL research. Datasets like GSM8k and MATH, while of high quality, provide a limited number of problems, subsequently constraining the potential for comprehensive RL training. Conversely, larger datasets often suffer from quality control issues.

Contributions

Big-Math Dataset: The authors introduce Big-Math, a dataset encompassing over 250,000 carefully selected math problems with uniquely verifiable solutions, open-ended formulations, and closed-form answers, specifically curated for RL applications.
Methodical Curation: By employing a detailed curation strategy consisting of rigorous filtering and cleaning—including human-in-the-loop iterations to refine their automated filters—the authors maintain a high standard in terms of both data quantity and quality. The dataset comes from nine distinct sources, analyzed to ensure they meet the desired criteria for RL usage.
Big-Math-Reformulated: The paper also provides a novel subset of 47,000 reformulated questions. These originally multiple-choice questions have been transformed into open-ended ones, following a systematic and validated reformulation process, thereby aligning them with RL training needs.
Diversity and Difficulty Analysis: The paper offers a thorough analysis of the dataset’s diversity across different mathematical domains and its problem-solving difficulty—using solve rates of a baseline model—thus making it adaptable for models with different capabilities and training requirements.

Implications and Future Directions

Theoretical and Practical Implications:
- On a theoretical front, the presence of a robust dataset like Big-Math allows for a nuanced analysis of RL training effects on different problem domains and difficulty levels, potentially leading to better curriculum learning strategies.
- Practically, the dataset serves as a scalable solution for researchers aiming to push the boundaries of reasoning in LLMs, providing a benchmark for evaluating the generalization capabilities of models trained on mixed-difficulty data.
Potential for Scaling Laws Investigation: Big-Math can facilitate investigations into scaling laws, observing how RL training efficacy scales with increased data diversity and complexity.
Filter Enhancement: The paper acknowledges filter over-stringency as a current limitation and sets the stage for future improvements. These enhancements could reintroduce valuable data previously discarded, enriching the dataset further.
Broader Application: Although primarily designed for mathematical reasoning, the strategies employed in filtering and curating Big-Math could be adapted for other domains requiring robust question datasets with verifiable solutions.

In conclusion, the Big-Math dataset not only promises to catalyze advancements in reasoning capabilities via RL in LLMs but also underscores an exemplary approach to dataset curation—balancing quality, quantity, and diversity in a field where these parameters are quintessential. Through the systematic reformulation of multiple-choice data and a comprehensive filtering process, this work sets a precedent in creating datasets equipped to explore the potential of RL in enhancing AI reasoning capabilities.

Tweets

https://twitter.com/NathanThinks/status/1895562582930702743

https://twitter.com/AlbalakAlon/status/1894231089914544477

https://twitter.com/synth_labs/status/1894504650306719981

https://twitter.com/Jose_A_Alonso/status/1894702330479546792

https://twitter.com/arxivsanitybot/status/1894939159464210746

https://twitter.com/citizenhicks/status/1894614204877082757