Math-Reasoning Data Synthesis

Updated 13 June 2026

Math-Reasoning Data Synthesis is the process of generating diverse, high-quality datasets tailored for training LLMs on complex mathematical problem solving and symbolic reasoning tasks.
Methodologies employ solution-guided augmentation, graph and key-point expansion, and programmatic verification to ensure data correctness and curriculum coverage.
Advanced techniques such as RL-based optimization and code-integrated pipelines boost model performance, achieving precision rates above 90% while significantly reducing synthesis costs.

Mathematical reasoning data synthesis is the construction of large, diverse, and high‐quality datasets tailored for training and evaluating LLMs on automated mathematical problem solving and symbolic reasoning tasks. This research area is motivated by the observation that LLMs exhibit significant performance increases when exposed to synthetic data engineered to provide curriculum coverage, reasoning chain supervision, verifiability, and complexity scaling. Multiple synthesis methodologies have been developed in recent literature, spanning solution-guided augmentation, graph and knowledge-point expansion, programmatic reasoning pipeline generation, code integration, evolution-driven trajectory sampling, and preference-optimized question generation. These pipelines are designed for efficiency, correctness, and maximal diversity in mathematical skills, supporting the rapid advancement of open-source and proprietary math-focused models.

1. General Principles and Motivations

Modern math-reasoning data synthesis is driven by the need for:

Diverse and scalable corpora far surpassing the size and variety of human-annotated benchmarks (e.g., GSM8K, MATH) (Lu et al., 2024, Tang et al., 2024, Wang et al., 2024, Wang et al., 29 Apr 2025).
Precise control over topic coverage, knowledge depth, and difficulty (Wang et al., 2024, Huang et al., 2024, Zhan et al., 7 Aug 2025).
Verifiability and noise reduction: filtering out logically inconsistent or unsolvable problems, and employing code-based or reward-model-based correctness checks (Lu et al., 2024, Wang et al., 29 Apr 2025, Chen et al., 26 Aug 2025, Zhao et al., 8 Oct 2025).
Direct modeling of reasoning structure, including intermediate steps and chain-of-thought (CoT) supervision (Ying et al., 2024, Xu et al., 9 Jun 2025, Wang et al., 16 Apr 2026).

The primary insight is that well-designed synthetic data can, when generated and curated with domain knowledge and automatic validation, close the performance gap between open models and closed-source systems such as GPT-4, while also enabling generalization, robustness, and deeper symbolic reasoning capabilities (Lu et al., 2024, Zhao et al., 8 Oct 2025).

2. Major Synthesis Methodologies

A diverse set of pipelines have been proposed and empirically validated in recent work:

Solution-Guided and Back-Translation Pipelines

MathGenie (Lu et al., 2024) demonstrates a staged approach:

Solution Augmentation: Iterative model-driven paraphrasing and numerical variation of seed solutions.
Question Back-Translation: Mapping augmented solutions back to valid questions using a model, preserving arithmetic/legal constraints.
Verification: Code-integrated solution generation and rationale-based automated filtering, ensuring correctness (empirical precision >90%).

This paradigm improves problem coverage and label reliability, boosting LLM accuracy on benchmarks over open-source baselines.

Knowledge-Graph and Key-Point Expansion

Pipelines like MathScale (Tang et al., 2024), GSDP (Wang et al., 2024), and KPDDS (Huang et al., 2024) proceed as follows:

Topic/Key-Point Extraction: Seed questions are annotated for fine-grained concepts and solution steps.
Graph Construction: Nodes are mathematical topics or key points; edges reflect co-occurrence or inferred relationships.
Expansion: By systematically sampling paths, walks, or cliques in the graph, new compositions of concepts serve as prompts for LLM-driven question generation.

Both explicit (co-occurrence) and implicit (multi-hop, community) graph relationships are exploited, supporting combinatorial dataset expansion (255× in GSDP), significant diversity, and low seed overlap. Deduplication and consensus-based answer filtering ensure only high-quality pairs are retained.

Programmatic and Code-Guided Synthesis

Program-assisted and rationally verifiable frameworks such as RV-Syn (Wang et al., 29 Apr 2025) and AMD (Chen et al., 26 Aug 2025) blend code execution with LLM generation:

Structured Function Libraries: Problem solutions are decomposed into executable Python functions.
Function-Graph Sampling: New computational graphs are generated by random or topic-guided subgraph assembly.
Back-Translation: Executed code’s output informs natural-language problem statements, maintaining one-to-one mapping between graph logic and task.
Verification: Bilateral checking—comparing code execution output and LLM-inferred solution—yields correctness rates up to 99%.

This ensures problems are logic-aligned, repeatable, and stable against small input variations.

Evolution, RL, and Preference-Based Construction

Recent works exploit optimization principles for hard or diverse problem generation:

ScaleQuest (Ding et al., 2024): Small math-capable models are fine-tuned to a question-generation regime, with preference model optimization (Direct Preference Optimization), reward-based filtering, and no reliance on high-powered proprietary LLMs.
MathSmith (Zhan et al., 7 Aug 2025): Uses nine soft-constraint strategies (multi-step, cross-topic, distractors, extreme conditions, etc.), with reinforcement learning maximizing complexity, structural validity, and consistency. Chain-of-thought token length is used as a proxy for reasoning complexity.
CoTEvol (Wang et al., 16 Apr 2026): Casts CoT synthesis as a genetic algorithm, using global crossover operators and uncertainty-driven step mutation, with step-level entropy and correctness-based reward.
rStar-Math (Guan et al., 8 Jan 2025): Employs Monte Carlo Tree Search (MCTS) over code-augmented CoT trajectories, with step-wise Q-value self-supervision and a process-preference model for reward feedback.

These frameworks not only generate harder and more diverse problems, but also directly optimize for multi-step logical correctness and robustness.

Cross-Problem and Multi-Task Fusion

MathFusion (Pei et al., 20 Mar 2025) and similar frameworks extend synthesis to the composition of instructions:

Sequential Fusion: The solution to one problem serves as a subtask in another.
Parallel Fusion: Two related problems are synthesized as a composite.
Conditional Fusion: Logical comparison and selection is required across subproblems.

Through embedding-based similarity search and prompt engineering, these strategies systematically expand the reasoning space and enhance the depth of instruction-following training.

3. Quality Control, Verification, and Filtering

Quality assurance is a cornerstone in modern synthesis pipelines. Methods include:

Answer Consistency: Requiring agreement across multiple sampled solutions (e.g., majority voting, answer consensus) (Lu et al., 2024, Huang et al., 2024, Wang et al., 2024, Wang et al., 16 Apr 2026).
Code-Based Verification: Executing code (Python, SymPy, Lean) to validate each step and final answer (Wang et al., 29 Apr 2025, Ying et al., 2024, Chen et al., 26 Aug 2025).
Joint LLM Scoring: Ensemble voting among multiple LLMs or joint scoring with weights adjusted for maximum agreement with a trusted reference (e.g., GPT-4) (Wang et al., 2024).
Difficulty and Diversity Metrics: Explicit tracking of topic coverage, reasoning step counts, and CoT token-length distributions (Tang et al., 2024, Zhan et al., 7 Aug 2025, Wang et al., 29 Apr 2025).
Sanity and Domain Constraints: Filtering out non-solvable, logically invalid, or statistically anomalous examples (e.g., operations outside domain, division by zero) (Lai et al., 6 Oct 2025, Xu et al., 9 Jun 2025).

Many pipelines report a post-verification retention rate of 35–55%, ensuring only high-confidence data enter the final corpus.

4. Dataset Scaling, Economics, and Empirical Impact

Synthesized math-reasoning datasets now reach into the multi-million scale (GSDP-MATH: 1.91 M (Wang et al., 2024), AMD: 12.3 M (Chen et al., 26 Aug 2025), MathScaleQA: 2 M (Tang et al., 2024)), with efficient corpses constructed at costs orders of magnitude below those using proprietary models. For example, GSDP is $\sim$ 180× cheaper than GPT-4-based synthesis (unit cost ≈$0.00012), while matching GPT-4 evaluated quality (94% precision vs. GPT-4 on a 5000-sample audit) (Wang et al., 2024).

Empirical fine-tuning and pre-training experiments consistently show that:

Synthetic data matching or surpassing the diversity and difficulty of the human-annotated corpora leads to 2–4× raw accuracy improvements on MATH, GSM8K, CollegeMath, OlympiadBench, etc. (Wang et al., 2024, Tang et al., 2024, Chen et al., 26 Aug 2025).
Scaling synthetic data volume yields logarithmic but persistent gains, with diminishing returns beyond the 1 M scale but heavier-tailed improvement for difficult (Olympiad, AIME, TheoremQA) tasks (Ding et al., 2024, Zhan et al., 7 Aug 2025).
Current top-performing open-source math LLMs were fine-tuned or pre-trained on one or more of these synthetic data sources, with state-of-the-art results now reaching or exceeding GPT-4 on specific math reasoning leaderboards (Lu et al., 2024, Huang et al., 2024, Wang et al., 2024, Wang et al., 29 Apr 2025).

5. Comparative Analysis and Method Selection

The diversity of data synthesis methodology impacts learning dynamics, transfer, and data efficiency:

Methodology	Strengths	Limitations
Key-Point/Graph	Systematic topic coverage, scalability	Dependence on KP extraction
Code/Programmatic	Verifiability, direct logic mapping	Potential coverage gaps; cost
RL/Preference	Hard instance discovery, reward control	High resource, instability
Cross-problem Fusion	Instruction-following, composite logic	Requires existing problem base
Knowledge-system	Semantic curriculum alignment	Toolkit engineering

A plausible implication is that hybrid pipelines—combining graph expansion, code-level synthesis, RL-hardness control, and cross-problem fusion—will dominate future high-performance math-reasoning model training.

In industrial and research practice, structured, interpretable data with deep post-generation filtering and model-based scoring "almost always" outperforms naive web-scraping or rule-based template expansion (Zhao et al., 8 Oct 2025).

6. Emerging Trends and Extensions

Recent and anticipated directions in math-reasoning data synthesis include:

Multimodal and Vision-Language Pipelines: Large-scale multimodal datasets (e.g., MathV360K (Shi et al., 2024)) exploit image mining, complexity-stratified sampling, and question augmentation to train visual math reasoning models, approaching GPT-4V accuracy.
Formal Proof and Lean Integration: Cohorts employing Lean or other theorem-prover languages for step-wise verification and generation (e.g., InternLM-Math (Ying et al., 2024)) enable progress in automated theorem proving and formal mathematics.
Adaptive and RL-Based Logic Supervision: Techniques such as AdaR (Lai et al., 6 Oct 2025) or RLVR enforce sound adaptation to variable perturbations, directly penalizing spurious or superficial reasoning.
Self-Evolving and Evolutionary Pipelines: Genetic and tree-search frameworks (CoTEvol (Wang et al., 16 Apr 2026), rStar-Math (Guan et al., 8 Jan 2025)) iterate population-based or tree-based trajectory refinement to robustly synthesize accurate multi-step CoTs.
Practical Cost and Data Value Optimization: Open-source efforts increasingly report fine-grained budget, compute, and "information value" methods for selection and filtering (gradient-based influence estimation (Zhou et al., 2024), joint LLM scoring (Wang et al., 2024)).
Instruction and Error-Aware Data Fusion: Approaches such as tutorship amplification or error-correction amplification, found to be empirically superior to simple diversification or query expansion, especially in pre-training stages (Chen et al., 23 Jan 2025).

The field shows sustained movement toward scalable, composite, and rigorously validated synthesis, not only for mathematical reasoning but as a template for other symbolic and logic-driven domains.

7. Data Synthesis Best Practices and Limitations

Best practices for practitioners include:

Explicit topic/knowledge-point tagging and graph construction to guide generation ("graph-aware routines").
Programmatic or code-integrated representations, enabling verification via execution and symbolic toolkits.
Multi-stage validation with both LLM and programmatic checks, and retention of only high-confidence or consensus entries.
Balanced difficulty and diversity sampling—do not over-prune for answer correctness if it reduces curriculum coverage (Lu et al., 2024).
Gradual scaling and mixture-integration with validated human-crafted or benchmark datasets (e.g., mixture sampling at α=0.2–0.3) (Zhao et al., 8 Oct 2025).
Continual data value scoring, redundancy filtering, and curriculum-aligned expansion.
Where feasible, fusion and composition of existing problem structures for instruction-following and task transfer (Pei et al., 20 Mar 2025).

Notable limitations and open problems:

Some pipelines can "overfit" to the logic of seed data or exhibit gaps for topics omitted from the initial function/toolkit set (Wang et al., 29 Apr 2025, Tang et al., 2024).
Program synthesis and code-based pipelines require substantial compute and have practical interface constraints.
Generalization to highly abstract, proof-heavy, or real-world quantitative tasks remains challenging despite breadth of topics and depth of correctness (Lai et al., 6 Oct 2025, Ying et al., 2024).

Continued research is required to optimize scaling, composition, abstraction, and true multi-domain transfer for next-generation math-reasoning LLMs.