OpenMathReasoning Dataset
- OpenMathReasoning is a comprehensive dataset featuring 540,000 high-school and Olympiad problems with multi-step, LaTeX-supported solution traces for rigorous mathematical reasoning.
- It integrates chain-of-thought, tool-integrated reasoning, and generative solution selection to provide diverse, high-quality problem-solving paths validated by automated and manual filtering.
- The dataset underpins competitive benchmarking and model development, demonstrating notable accuracy gains on AIME, HMMT, and AIMO-2 tasks under an open Apache 2.0 license.
The OpenMathReasoning Dataset is the largest publicly available corpus of long-form mathematical reasoning, comprising 540,000 unique, high-school and olympiad-level problems sourced from the Art of Problem Solving (AoPS) forums, together with 3.2 million high-quality chain-of-thought (CoT) solutions, 1.7 million tool-integrated reasoning (TIR) solutions that blend code execution and natural-language deduction, and 566,000 generative solution selection (GenSelect) cases. Its design supports the development and benchmarking of advanced neural models for mathematical problem solving, including the integration of automated reasoning and programmatic verification (Moshkov et al., 23 Apr 2025).
1. Dataset Composition and Sources
OpenMathReasoning draws exclusively from AoPS forum posts, excluding the “Middle School Math” category to focus on high-school and olympiad-level content. The problem set spans routine contest practice through national-level olympiad questions. Elementary binary or multiple-choice problems are excluded via automatic classification, resulting in a challenging collection consistent with contest problem distributions.
Of the 540,000 problems:
- 260,000 originated as proof-based questions and were converted to numeric-answer form.
- 190,000 have answers extracted directly from forum discussions.
- 90,000 lack explicit answers; for these, the dataset designates as ground truth the most common model prediction across multiple candidates.
Table 1: Problem and Solution Counts
| Category | Count | Notes |
|---|---|---|
| Unique math problems | 540,000 | All non–Middle School AoPS, high school+ |
| Chain-of-Thought solutions | 3,200,000 | 8–12 steps, LaTeX-supported, filtered |
| Tool-Integrated Reasoning | 1,700,000 | Python code + NL, up to 8 tool-calls |
| GenSelect cases | 566,000 | Solution selection among 2–16 candidates |
CoT reasoning traces on average consist of 8–12 explicit steps, frequently employing LaTeX-style mathematical notation. TIR solutions interleave at most 8 Python code execution blocks in a single 16,384-token trace, with blocks supporting novel computation or verification of CoT steps. GenSelect examples feature candidate solution summaries with a correct index label.
2. Data Format and Schema
The corpus is stored as a single JSONL file, “problems.jsonl,” with one record per problem. Each record encapsulates all data relevant to a single math problem, according to the following schema:
1 2 3 4 5 6 7 8 9 10 11 12 |
{
"id": <string>, // unique identifier
"source": "AoPS",
"problem_text": <string>, // may include LaTeX
"answer": <string> or null,
"cot_solutions": [Solution], // CoT traces
"tir_solutions": [Solution], // TIR traces, optional
"genselect": {
"candidates": [<string>, ...], // Solution summaries
"selected": <integer> // Index of correct summary
}
} |
Each Solution object conforms to:
1 2 3 4 5 6 |
{
"solution_id": <string>,
"steps": [<string>, ...], // LaTeX/NL steps
"summary": <string>, // e.g. “%%%%0%%%%”
"full_text": <string>
} |
Mathematical notation is supported throughout via inline LaTeX. TIR traces incorporate code either as Markdown blocks or explicit <tool_call></tool_call> tags.
An abridged sample record for a combinatorics problem:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
{
"id": "cassowary-0001",
"problem_text": "Call a 9-digit number a *cassowary* if it uses each digit 1–9 exactly once. Compute the number of prime cassowaries.",
"answer": "0",
"cot_solutions": [
{
"solution_id": "cot-QwQ32B-1",
"steps": [
"Compute the digit sum: %%%%1%%%%.",
"Since %%%%2%%%%, any permutation of 1–9 is divisible by 9 and composite.",
"Hence there are no prime cassowaries."
],
"summary": "All 9-digit pandigitals sum to 45, divisible by 9 ⇒ composite ⇒ \boxed{0}",
"full_text": "<think>…</think> \boxed{0}"
}
],
"tir_solutions": [...],
"genselect": {
"candidates": [
"... summary of solution A ...",
"... of solution B ..."
],
"selected": 0
}
} |
3. Annotation, Curation, and Quality Assurance
High-quality reasoning is ensured through multi-stage candidate generation and rigorous filtering:
- CoT filtering: Up to 32 candidates are generated per problem from QwQ-32B and DeepSeek-R1 with temperature 0.7, top-p 0.95, max 16,384 tokens. Qwen2.5-32B-Instruct is prompted to check answer equivalence; only chains that match the expected answer are retained. From 5.2M raw CoT chains, 3.2M survive this filtration (500K of 1M from QwQ-32B, 2.7M of 4.2M from DeepSeek-R1).
- TIR solution generation: In the stage 0 pipeline, LIMO-Qwen-32B generates 1.2M TIR candidates (each with up to 8 code blocks). Code blocks are classified via LLM as novel/verification and significant/moderate/trivial axes. Solutions are retained only if they contain at least one novel and significant block, or if more than half are novel/moderate. Only traces yielding the correct answer and with at most 2 code blocks are retained; stage 0 yields 15,000 examples. Subsequent iterative filtering and fine-tuning (first with QwQ-32B, later with a 14B model) increase this figure to 1.7M TIR solutions.
- GenSelect data construction: For each problem, 2–16 solution summaries (with at least one correct and one incorrect) are sampled repeatedly, creating up to 8 comparisons per problem. QwQ-32B selects the preferred candidate among these; examples where the model incorrectly selects are removed. This yields 566,000 high-precision GenSelect entries from an initial 1M raw set.
4. Dataset Splits, Licensing, and Access
OpenMathReasoning provides data splits designed for both competition benchmarking and general cross-validation:
- The “Comp-Math-24-25” validation set contains 256 problems from AIME 2024/25 and HMMT 2024/25.
- Approximately 539,744 AoPS problems form the training pool.
- An additional standard 80/10/10 train/dev/test split is provided for general use.
All data and code are licensed under the commercially permissive Apache 2.0 license, enabling unrestricted use, modification, and redistribution by researchers and industry practitioners.
5. Benchmarking and Model Development
The dataset underpins training for models spanning Qwen2.5-Base 1.5B, 7B, 14B, and 32B—released as “OpenMath-Nemotron”—on mixed CoT, TIR, and GenSelect tasks, totaling 5.5 million examples. Notable results include:
- On “Comp-Math-24-25” (majority@64 accuracy, AIME 2024):
- Qwen2.5-Base-1.5B: 26.8% → OpenMath-Nemotron-1.5B CoT: 61.6%
- Qwen2.5-Base-14B: 65.8% → OpenMath-Nemotron-14B CoT: 76.3%
- TIR prompt yields an additional +10 percentage points over CoT alone.
- GenSelect with 16 candidates provides an extra +10 percentage points, reaching approximately 90% accuracy.
The 14B model, employing TensorRT-LLM optimizations and methodical merging of CoT and TIR checkpoints, solved 34 out of 50 previously unpublished AIMO-2 competition tasks within 5 hours on four L4 GPUs, which secured the competition victory (Moshkov et al., 23 Apr 2025).
6. Comparisons with Prior Work and Prospects
Relative to prior AoPS-derived datasets such as AoPS-Instruct (650,000 problems, lacking long CoT or TIR) or NuminaMath-1.5 (68,000 problems), OpenMathReasoning maintains both greater problem diversity and far more detailed solution traces. Compared to widely cited general reasoning datasets, such as GSM8K (8,000 grade-school word problems) and MATH (12,000 college-level proof problems), OpenMathReasoning offers substantially larger scale and uniquely integrates multi-step code execution.
Possible future extensions include:
- Augmentation with backward-reasoning (BackMATH-style) examples,
- Synthetic expansion of under-represented topics via GPT-4 bootstrapping,
- Joint use with datasets like Omni-Math or MAmmoTH to enable richer instruction-tuning routines.
7. Impact and Research Applications
OpenMathReasoning has demonstrably powered state-of-the-art neural models in open-weight mathematical reasoning, supporting significant reductions in error rates on AIME and HMMT benchmarks and competitive success in the AIMO-2 contest. Its release under an open license is likely to catalyze further progress in tool-augmented mathematical reasoning and the development of models capable of robust, multi-step problem-solving in symbolic domains (Moshkov et al., 23 Apr 2025).