NuminaMath Dataset Overview
- NuminaMath is a scalable, contest-grade dataset offering detailed, step-by-step annotated math solutions in LaTeX, sourced from competitions, exams, and forums.
- Its derivative corpora—FIM, Bridge, and CFT—enhance model training through methods like fill-in-the-middle reconstruction, gap bridging, and critique-based fine-tuning.
- Empirical results show that using these variants significantly improves learning accuracy across benchmarks such as AMC23, AIME24, and GSM8K.
NuminaMath Dataset provides a scalable, contest-grade corpus for training and evaluating mathematical reasoning in LLMs. Developed to address the lack of high-quality, step-by-step annotated math problem solutions at scale, NuminaMath and its major derivatives—including NuminaMath-CoT, NuminaMath-FIM, NuminaMath-Bridge, and NuminaMath-CFT—enable fine-grained investigation and systematic enhancement of machine mathematical abilities through Chain-of-Thought (CoT) paradigms, intermediate-step expansion, and automated critique protocols.
1. Origins and Underlying Structure
NuminaMath originated as a large-scale, open-source repository of mathematical problems and solutions curated from national and international competitions, high-school and university examinations, and online mathematics forums. The foundational release, NuminaMath-CoT, comprises approximately 853,000 to 860,000 human-verified question–solution pairs spanning a broad spectrum of topics: algebra, geometry, combinatorics, number theory, probability, and elementary arithmetic. Each example is formatted in LaTeX and articulated using a step-by-step Chain-of-Thought style. Step granularity varies naturally due to aggregation from diverse expert sources, ranging from routine calculations to Olympiad-level reasoning (Yan et al., 17 Feb 2025, Xu et al., 20 May 2025, Wang et al., 29 Jan 2025).
The primary value proposition of NuminaMath lies in its scale, verification, and coherence, facilitating precise instruction-tuning and benchmarking of mathematical reasoning in LLMs. NuminaMath-CoT serves as the foundation for experimentation with advanced data augmentation and error-feedback methods.
2. Major Derivative Corpora and Expansion Protocols
NuminaMath has spawned several specialized datasets optimized for distinct modeling strategies, each designed to address specific limitations of raw CoT resources.
2.1 NuminaMath-FIM
NuminaMath-FIM [Fill-in-the-Middle] is constructed via an automated fill-in-the-middle transformation of NuminaMath-CoT. Each CoT solution is decomposed into ordered reasoning steps; for each solution, three random indices are sampled to create prefix , middle , and suffix segments. Special tokens <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|> (in the PSM order of Bavarian et al. 2022) delineate input: for each problem , the model is tasked with reconstructing given . This transformation yields approximately 2.56 million FIM-style samples from 853,000 unique problems, with each sample containing one held-out step (Yan et al., 17 Feb 2025).
2.2 NuminaMath-Bridge
NuminaMath-Bridge is generated by detecting “thought leaps” within NuminaMath-CoT solution chains—steps where intermediate logical details are omitted. Using an external bridging model trained on a synthetic, structured corpus (ScaleQM+), missing steps are inserted to restore logical completeness. The process systematically augments each solution chain, retaining the training/validation/test partitions of the original dataset (Xu et al., 20 May 2025).
2.3 NuminaMath-CFT
NuminaMath-CFT defines a subset tailored for Critique Fine-Tuning. From the 860,000 full pool, 50,000 examples are randomly sampled; for each, the possibly noisy solution is used as a candidate for critique. GPT-4o-1120 (“teacher critique model”) is prompted to generate detailed critiques—diagnosing correctness, logical errors, or omissions—followed by a terminal correctness label. Each record thus comprises the problem, noisy solution, and corresponding critique. All 50,000 examples are used for one epoch of model fine-tuning without further splitting (Wang et al., 29 Jan 2025).
3. Data Schema and Format
Records across NuminaMath and its major derivatives conform to clear, structured schemas:
| Version | Input Fields | Target / Output |
|---|---|---|
| NuminaMath-CoT | question, full stepwise solution | — |
| NuminaMath-FIM | question, prefix, suffix (< | fim_prefix |
| NuminaMath-Bridge | question, step sequence with inserted steps | bridged solution chain |
| NuminaMath-CFT | question, noisy solution | GPT-4o critique + conclusion |
All problem statements and solutions use LaTeX encoding for mathematical notation. In NuminaMath-FIM and -Bridge, logical and mathematical coherence is inherited from the original human-verified chains or is restored by bridging models referencing structured reasoning templates (Yan et al., 17 Feb 2025, Xu et al., 20 May 2025).
A representative LaTeX-formatted NuminaMath-FIM example:
1 2 3 4 5 6 |
<|fim_prefix|> Q: What is the area of a circle of radius 3? 1. %%%%8%%%% <|fim_suffix|> 3. %%%%9%%%% <|fim_middle|> |
1 |
2. Substitute %%%%10%%%%: %%%%11%%%%. |
4. Construction Workflow and Quality Control
Each dataset derivative leverages both deterministic procedures and probabilistic sampling to maximize coverage while maintaining data integrity.
- NuminaMath-FIM: Ensures human-verified context by exclusively using original, validated solutions. The random sampling of held-out indices diversifies learning signals. Loss during fine-tuning is computed only on the tokens following the <|fim_middle|> marker, tightly focusing the learning objective. During inference, a string-similarity threshold (similarity > 0.8) filters out trivial “echo” predictions, ensuring that generated intermediate steps possess genuine elaborative value (Yan et al., 17 Feb 2025).
- NuminaMath-Bridge: Applies algorithmic gap detection and insertion via CoT-Bridge, augmenting the original data with explicit steps while maintaining logical flow. Validity is assessed through repeated sampling, model accuracy gains, and qualitative inspection (Xu et al., 20 May 2025).
- NuminaMath-CFT: Critiques are generated directly by GPT-4o in response to the released (unfiltered) solutions, reflecting a minimally biased testbed for model critique abilities (Wang et al., 29 Jan 2025).
5. Coverage, Statistics, and Benchmarking Utility
NuminaMath and its derivatives provide extensive coverage across mathematical domains and difficulty levels, with the following high-level properties:
| Property | NuminaMath-CoT | NuminaMath-FIM | NuminaMath-Bridge | NuminaMath-CFT |
|---|---|---|---|---|
| Problems | 853K–860K | 853K | 859K | 50K |
| Data Points / Samples | 853K | 2.56M | 859K chains | 50K |
| Style | CoT | FIM (PSM) | CoT bridged | [Q, noisy A, critique] |
| Step Granularity | 1–20 / prob. | 1 held-out/prob | variable, bridged | — |
| Primary Use | SFT | FIM SFT | SFT / validation | CFT |
| Key Benefit | Chain veracity | Step insertion | Gap restoration | Critique data |
NuminaMath datasets are routinely used for benchmarking advances in mathematical LLMs (e.g., Meta-Llama 3.1-8B, Qwen2.5-Math-1.5B, DeepSeek-Math families), and for evaluating model generalization to external benchmarks such as GSM8K, MATH, MathOdyssey, OlympiadBench, AMC23, and AIME24 (Yan et al., 17 Feb 2025, Xu et al., 20 May 2025, Wang et al., 29 Jan 2025).
6. Impact on Model Training and Downstream Performance
Systematic experiments demonstrate that dataset variants targeting intermediate step recovery or reasoning completeness produce measurable accuracy improvements across standard benchmarks:
- Bridged Chains: Replacing NuminaMath-CoT with NuminaMath-Bridge yields substantial absolute accuracy gains. For Meta-Llama 3.1-8B, mean benchmark accuracy increases from 43.87% (CoT) to 49.74% (bridged, +5.87 pp), with even larger relative gains (e.g., 20%→35.63% on AMC23, +15.63 pp) (Xu et al., 20 May 2025).
- Fill-in-the-Middle Expansion: Training models to recover omitted steps using NuminaMath-FIM improves intermediate-step reasoning without recourse to powerful external models and enables automated dataset expansion through plug-and-play application of the MathFimer model (Yan et al., 17 Feb 2025).
- Critique Fine-Tuning: Critique-based fine-tuning on a 50K slice of NuminaMath-CFT delivers a +5.5 percentage point gain in mean accuracy over conventional supervised fine-tuning (e.g., 47.3%→52.8% on Qwen2.5-Math-7B), with notable uplifts on MATH, AMC23, and AIME24 (Wang et al., 29 Jan 2025).
A plausible implication is that augmentation protocols focusing on step completeness and error diagnosis are critical for enabling robust, granular mathematical reasoning in both small and large LLMs.
7. Role in the Research Ecosystem and Future Directions
NuminaMath and its FIM, Bridge, and CFT derivatives constitute a central infrastructure for both algorithmic innovation and careful evaluation in math-focused LLM research. Their open, LaTeX-native, competition-grade formulation allows direct interoperability with other datasets (MetaMathQA, WebInstruct) and facilitates both direct supervised fine-tuning and more advanced learning paradigms such as critique feedback and CoT-bridging.
Ongoing trends include the development of generalized augmentation pipelines (FIM, gap-bridging) with minimal external dependencies and the integration of critique signals in reinforcement learning or knowledge distillation frameworks. A plausible implication is that continued refinement of such corpora—by targeted expert re-annotation, domain coverage extension, and integration with out-of-domain reasoning tasks—will compound gains in mathematical and general logical model performance.
References:
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task (Yan et al., 17 Feb 2025)
- Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning (Xu et al., 20 May 2025)
- Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate (Wang et al., 29 Jan 2025)