MetaMathQA: Benchmark for Math Reasoning

Updated 26 March 2026

MetaMathQA is a comprehensive dataset designed to enhance mathematical reasoning by combining diverse multi-stream data augmentations from GSM8K and MATH.
It integrates forward and backward reasoning techniques, generating multiple solution paths and question rephrasings to validate correct answers.
Empirical studies show that models fine-tuned on MetaMathQA achieve significantly higher accuracy, establishing it as a key training and evaluation resource.

MetaMathQA is a large-scale, systematically augmented mathematical question–answering dataset explicitly constructed to advance the mathematical reasoning capabilities of LLMs. Combining breadth, diversity, and structural sophistication, MetaMathQA has emerged as a critical benchmark and training resource for evaluating and improving LMs’ step-by-step problem-solving, particularly in elementary and high school mathematics domains.

1. Construction Methodology and Data Augmentation

MetaMathQA originates from two core math reasoning benchmarks: GSM8K (grade school word problems) and MATH (open-ended math olympiad problems covering seven subjects). Its design is distinguished by a multi-stream augmentation pipeline to maximize question and reasoning diversity, counteract model saturation, and “activate” latent mathematical structure in pretrained LMs (Yu et al., 2023).

MetaMathQA is built via four complementary data transformation streams applied to each $(q_i, r_i, a^*_i)$ (question, chain-of-thought, answer) tuple from the original benchmarks:

Answer Augmentation (AnsAug): For each question, GPT-3.5-Turbo is prompted to generate $K_1$ distinct solution paths and answers using few-shot CoT and sampling; only those matching the ground-truth answer $a^*_i$ are retained.
Question Rephrasing: Multiple semantic rephrasings of $q_i$ are sampled ( $K_2$ variants) and validated by running generated solutions; only correct answer-yielding variants are added.
Backward Reasoning (Self-Verification, SV): The question is rewritten as a declarative statement with a masked variable, paired with the answer and “What is the value of $x$ ?”; only correct variants are included.
Backward Reasoning (FOBAR): The original question with variable $x$ is appended with “If the answer is $a^*_i$ , what is $x$ ?”; correct instances are kept.

The union of all validated variants forms MetaMathQA. The dataset preserves the original GSM8K and MATH train/test splits, avoiding test leakage by direct or paraphrased overlap.

2. Dataset Scale, Structure, and Content

MetaMathQA comprises approximately 395,000 unique instruction–response QA pairs:

Dataset	AnsAug	Rephrase	SV	FOBAR	Total
MetaMathQA-GSM8K	80 K	80 K	40 K	40 K	240 K
MetaMathQA-MATH	75 K	50 K	15 K	15 K	155 K
MetaMathQA (merged)	155 K	130 K	55 K	55 K	395 K

Each sample provides:

An Instruction (word problem or its rewritten/deconstructed variant)
A chain-of-thought (fully worked solution)
A single, numerically-quantified final answer

All data is formatted for instruction–response training and evaluation, with explicit CoT to enforce stepwise reasoning.

MetaMathQA spans a diversity of question phrasings, solution chains, and variable-masked, backward questions. For instance, a simple arithmetic problem may appear as a direct computation, a rephrased version, a backward instance (solving for a missing variable given the answer), or as a self-verified declarative statement.

3. Empirical Effects and Ablation Studies

The impact of MetaMathQA on model capabilities has been substantiated by extensive quantitative studies (Yu et al., 2023). Open-source LLaMA-2 models fine-tuned on MetaMathQA, across sizes from 7B to 70B parameters, show strong improvements over both base and state-of-the-art specialized models such as WizardMath.

Model	#Params	GSM8K	MATH
LLaMA-2-7B (base)	7B	14.6	2.5
WizardMath-7B	7B	54.9	10.7
MetaMath-7B	7B	66.5	19.8
LLaMA-2-70B (base)	70B	56.8	13.5
WizardMath-70B	70B	81.6	22.7
MetaMath-70B	70B	82.3	26.6

Contributions of different augmentation streams are additive. Ablation studies on GSM8K demonstrate that:

SFT only: 41.6%
With AnsAug or Rephrase: $\sim$ 59.6%–59.7%
Combining all: 64.4%
Adding SV and FOBAR provides an additional $\sim$ 4% absolute improvement.

There is a strong positive correlation ( $\rho = 0.97$ ) between dataset diversity (distinct solution/reasoning types) and model accuracy gains.

4. Role in Training and Evaluation of Advanced Architectures

MetaMathQA’s structured diversity and coverage are leveraged as a primary fine-tuning and evaluation resource for multiple state-of-the-art reasoning architectures:

Blockwise SFT for Diffusion LMs: When combined with the GSAI-ML/LLaDA-8B-Instruct model, MetaMathQA allows direct assessment of blockwise training-inference alignment (Sun et al., 27 Aug 2025). Under both equal-compute and equal-token budgets, models fine-tuned on MetaMathQA with Blockwise SFT demonstrate superior Pass@1 (exact match) on GSM8K and MATH compared to classical SFT. For example, Blockwise SFT achieves 76.0% on GSM8K and 34.2% on MATH versus 67.7% and 29.6% for classical SFT, respectively.
Verifier-Guided DPO Post-Training: In post-SFT preference learning pipelines, MetaMathQA serves as the chain-of-thought SFT base for small LMs. The structured error landscape of MetaMathQA samples enables targeted hard-negative mining, as shown by integrating a compact MathVerifier for decomposed error scoring and importance-weighted Direct Preference Optimization (DPO), yielding 2–4 point boosts on GSM8K and MATH for 1.5B-parameter models beyond both vanilla SFT and unweighted DPO (Lu et al., 17 Dec 2025).

5. Evaluation Protocols and Metrics

Training with MetaMathQA utilizes instruction–response or zero-shot "Let's think step by step" prompting and predominantly employs the Pass@1 metric:

GSM8K: Pass@1 is determined by exact string match of the model's final numeric answer.
MATH: String match is required over the complete generated solution output.

Evaluation is typically performed on the held-out standard test splits of GSM8K and MATH, free of overlapping or paraphrased train data.

6. Known Limitations and Directions for Enhancement

Despite demonstrated gains, certain limitations persist:

Diminished model performance on very long or highly compositional questions.
Diminishing returns when further aggregating external “all public” math datasets; incorporating RFT data post-MetaMathQA, for example, has been shown to decrease performance.
Remaining accuracy gap on advanced math topics and extended backward-reasoning benchmarks.

Prospective enhancements include scaling to more advanced domains, automatic filtering of low-quality rephrasings, and adaptation to other structured multi-step reasoning tasks (e.g., combinatorial games).

7. Significance for Mathematical Reasoning and LLM Development

MetaMathQA establishes several methodological precedents:

It demonstrates that targeted augmentation—specifically, mixing forward (AnsAug, Rephrase) and backward (SV, FOBAR) transformations—expands both reasoning diversity and generalization, leading to marked accuracy improvements (Yu et al., 2023).
The dataset’s design enables explicit control over syntactic/structural variety for rigorous ablation and error analysis, which can be essential in evaluating LLM mathematical “reasoning” rather than superficial pattern matching.
MetaMathQA has proven instrumental in diagnosing and correcting training–inference mismatches in diffusion-based LLMs by exposing the significance of aligning fine-tuning data structures to inference modalities (Sun et al., 27 Aug 2025).
By supporting the mining of “hard negatives” in post-SFT preference learning, MetaMathQA provides a foundation for structured error discovery and targeted correction, as required by verifier-guided methods (Lu et al., 17 Dec 2025).

In sum, MetaMathQA is a cornerstone resource in the open-source mathematical reasoning ecosystem, encompassing diverse, high-precision, and structurally varied QA data. Its construction, scale, and empirical results continue to shape best practices in mathematical LLM fine-tuning, evaluation, and post-processing.