DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning (2504.11456v2)

Published 15 Apr 2025 in cs.CL and cs.AI

Abstract: Reinforcement learning (RL) with LLMs shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.

Summary

Advancing Mathematical Reasoning in AI: DeepMath-103K Dataset

The paper presents DeepMath-103K, a large-scale dataset aimed at enhancing mathematical reasoning within AI systems. This dataset primarily addresses the limitations of existing mathematical datasets which often lack sufficient challenge, verifiable answer formats suitable for reinforcement learning (RL), and contamination with evaluation benchmarks. DeepMath-103K is meticulously crafted to support the training of advanced reasoning models, offering approximately 103,000 mathematical problems with verified answers and multiple solution paths.

Core Contributions

Dataset Scale and Difficulty: DeepMath-103K consists of around 103,000 problems, primarily at difficulty levels 5-9. This positions it significantly beyond the reach of current models trained on conventional datasets both in volume and complexity, thereby challenging AI models to adapt to more sophisticated problem-solving strategies.
Data Decontamination: One of the dataset's defining features is its rigorous decontamination process. This ensures the exclusion of problems overlapping with established evaluation benchmarks, thus maintaining the integrity of performance evaluations conducted post-training.
Multi-Solution Format: Each problem includes three distinct generated solutions, provided by DeepSeek-R1. This format supports diverse training paradigms, including supervised fine-tuning, reward modeling, and model distillation, offering rich comparative data to encourage diversified reasoning approaches.
Reinforcement Learning Compatibility: The inclusion of verifiable final answers enables the application of RL techniques, such as RL-Zero, that leverage rule-based reward schemes. This is crucial for developing models that not only solve mathematical problems but also refine their reasoning through feedback on accuracy.

Empirical Validation

The authors demonstrated the dataset's efficacy by using it to train models via various paradigms, including RL-Zero. Models trained with DeepMath-103K exhibited remarkable improvements across several challenging benchmarks—MATH500, AIME, AMC, Minerva Math, and OlympiadBench. Specifically, models trained on this dataset achieved higher accuracy (significantly improved pass@1 accuracy observed in RL-Zero settings), underscoring the dataset's potential to advance model capabilities in mathematical reasoning.

Implications and Future Directions

Practically, DeepMath-103K serves as a stepping stone toward training AI models capable of tackling complex mathematical reasoning tasks. Theoretically, it offers insights into designing training data that bridges the gap between RL and advanced problem-solving. As AI systems continue to evolve, datasets like DeepMath-103K that focus on verifiable and challenging problems will likely become invaluable in perfecting AI reasoning abilities.

Future developments could focus on integrating more expansive problem categories, addressing not only mathematics but cross-disciplinary STEM challenges, thus broadening the applicability of AI in diverse contexts. Additionally, exploring RL strategies that adapt to such datasets could uncover novel techniques for incremental learning and adaptability in AI systems.

In summary, DeepMath-103K fills a critical gap in AI training resources, combining complexity with verifiable accuracy, and paving the way for more proficient and reliable AI reasoning systems.