Arithmetic Reasoning Datasets Overview

Updated 6 May 2026

Arithmetic Reasoning Datasets are structured collections designed to evaluate numerical manipulation, compositional reasoning, and multi-modal arithmetic tasks.
These datasets leverage varied modalities—text, visual, code, and symbolic—to assess performance on both elementary and multi-step computation problems.
They are constructed using diverse methods including manual curation, programmatic generation, and LLM-assisted verification, enhancing model evaluation and training.

Arithmetic reasoning datasets constitute a central class of benchmarks and training corpora in contemporary natural language processing and cognitive AI, targeting the systematic evaluation, pre-training, and error analysis of models on tasks requiring manipulation of numerical quantities, basic operations, and compositional multi-step computation. Such datasets cover an expansive range of data modalities—including textual, visual, code-driven, and symbolic forms—spanning elementary word problems, abstract expressions, visual scene-based arithmetic, multi-lingual and multi-format collections, and code or tool-augmented settings. The following sections expound the construction, composition, methodological validation, and practical impact of recent large-scale and fine-grained arithmetic reasoning resources.

1. Dataset Taxonomy and Representative Corpora

Arithmetic reasoning datasets are organized along multiple axes: modality (text, visual, structured), complexity (single-operation vs. multi-step), domain (elementary, domain-knowledge, cross-lingual), and operational coverage (addition, subtraction, multiplication, division, fractions, percentages, and formulas).

Canonical Examples:

AM-DeepSeek-R1-Distilled-1.4M: The largest open-source reasoning dataset to date, with 1.4M prompt–response pairs, including ≈246k pure arithmetic tasks. Sub-categories: integer operations (≈145k), fraction/decimal (≈82k), percentage (≈41k), and word-based arithmetic (≈142k) (Zhao et al., 25 Mar 2025).
NumGLUE: Eight diverse tasks, from commonsense single-step, scientific QA, and RC-based inference, to arithmetic word problems. Over 100k instances, unifying various reasoning settings (Mishra et al., 2022).
MATH 401: 401 expressions, methodically partitioned across operator types (addition, multiplication, large and small integer/float, irrational, exponentiation, trigonometric, logarithmic), with fine-grained error metrics (Yuan et al., 2023).
MsAT: Synthetic 85k problem set, systematically covering 1, 2, or 3 step operations using the four basic arithmetic operators in code-executable format (Wang et al., 2023).
GSM-Ranges: Systematically perturbs GSM8K word problems over six orders of magnitude, yielding 30k perturbed instances for stress-testing LLM performance as number scale increases (Shrestha et al., 12 Feb 2025).
CALC-X: 300k+ arithmetic problems with CoT, integrating GSM8K, AquA-RAT, MAWPS, MathQA, SVAMP, Ape210K, ASDiv-A; all chains structured with tool-call markup for symbolic calculators (Kadlčík et al., 2023).
CMATH: 1.7k Chinese elementary word problems, graded (1–6), annotated for number of steps and maximum digit-length, serving as interpretable human-level benchmarks (Wei et al., 2023).
MNS: 70k visual arithmetic tasks (train/val/test), leveraging And-Or Graph templates for abstract relational reasoning—direct pixel-to-integer mapping (Zhang et al., 2020).
FormulaReasoning: 5.4k formula-based physics word problems, predominantly in Chinese, with canonical formula annotation, parameter tables, and a 271-formula knowledge base (Li et al., 2024).
MathMist: 21k multilingual QAs (seven languages), with 2,142 pure arithmetic problems per language, supporting LaTeX-based reasoning and error-perturbation analysis (Sobhani et al., 16 Oct 2025).
CLEVR-Math: 683k multi-modal image+text state-change arithmetic problems, testing compositional generalization (addition/subtraction), with ground-truth functional programs (Lindström et al., 2022).
Archer: 2,084 bilingual text-to-SQL instances, 100% arithmetic (addition, subtraction, multiplication, division) with deep nesting/subquery SQL behavior (Zheng et al., 2024).

2. Problem Structure, Complexity, and Format

Arithmetic reasoning datasets exhibit characteristic input/output structures, complexity gradations, and operator coverage.

Textual Word Problems: Prominent in GSM8K, MAWPS, CMATH, NumGLUE, SIMPL-ARITH, MsAT, with formats such as “X apples, Y given away, how many remain?” Requisite linguistic parsing and operation induction.
Expression-Only Tasks: Math401, MsAT, and Calc-X (Ape210K, MathQA, SVAMP, etc.) feature arithmetic expressions in raw or linearized form devoid of distracting prose, supporting direct calculation.
Multi-Modal and Visual: CLEVR-Math, MNS—image or scene-based reasoning tasks necessitating visual grounding, object counting, and mapping of symbolic operators to pixel layouts.
Code or Tool-Augmented Chains: Calc-X formalizes all intermediate arithmetic steps via explicit <gadget> tags for calculator interaction, enabling offloading to symbolic engines.
Formula-Based Tasks: FormulaReasoning (FormulaQA): tasks explicitly annotated with normalized formulas, parameter assignments, and variable-level units/symbols, targeting physics and engineering word problems.
SQL/Database Arithmetic: Archer includes queries requiring arithmetic expressions (aggregate, HAVING, nested subqueries, GROUP BY).

Problem complexity is determined by steps (single vs. multi-step), numerical range (small integers to 10^7), syntactic form (free-form, code, LaTeX), and required mathematical knowledge (domain formulae, external commonsense).

3. Data Construction, Cleansing, and Validation

Emergent arithmetic datasets draw from manual curation, crowdsourcing, programmatic template instantiation, grammar-based sampling, dataset unification, and complex post-processing to ensure diversity, non-redundancy, and rigor.

Processes:

Deduplication and Diversity: Semantic embedding clustering and explicit Jaccard-based n-gram filtering ensure high intra-dataset diversity and no evaluation leakage, e.g., AM-DeepSeek-R1-Distilled, CALC-X (Zhao et al., 25 Mar 2025, Kadlčík et al., 2023).
Programmatic Generation: MsAT samples all operator-templates up to depth-3 and explicitly solves for consistent integer parameters; synthetic log-uniform distribution in [1, 10^7] for magnitude balancing in datasets used by fine-tuning studies (Wang et al., 2023, Gangwar et al., 18 Feb 2025).
Verification and Filtering: All mathematical tasks are validated by symbolic execution (SymPy, Math-Verify), tolerance-checked for floating-point answers (≤10^–6), and further rescored via LLM-based correctness classifiers or rule-based schema compliance (Zhao et al., 25 Mar 2025, Kadlčík et al., 2023).
Format-Dependent Checks: SQL benchmarks (Archer) validated by factual execution on the ground-truth database, with execution accuracy (EX) as primary metric (Zheng et al., 2024).
Manual + LLM-Assisted Annotation: FormulaReasoning leverages LLM-prompted draft annotation, manual review, formula verification scripts, and parameter assignment for each domain variable (Li et al., 2024).

4. Evaluation Regimes and Benchmarking

Datasets for arithmetic reasoning underpin model evaluation using metrics tuned to their structural properties:

Final-Answer Accuracy: Exact match—integer/final scalar value (e.g., CMATH, Math401, MsAT, GSM8K, CLEVR-Math, MAWPS, SVAMP).
F1 and Macro-Averaged Precision/Recall: Used for multi-class or span-based tasks (NumGLUE, NUMBERGAME) (Mishra et al., 2022, Mishra et al., 2020).
Step-Level and Sub-Chain Accuracy: Assess intermediate calculation correctness across CoT traces, especially in fine-grained error analysis (GSM8K arithmetic accuracy, MsAT, CALC-X) (Kadlčík et al., 2023).
Error Decomposition: Datasets like GSM-Ranges introduce logical vs. arithmetic error metrics, distinguishing slips in calculation from erroneous reasoning sequence or step omission (Shrestha et al., 12 Feb 2025).
Pass@N, LLM-as-a-Judge, and Code-Switching: MathMist supports Pass@3, multilingual answering, and chain-of-thought judgment with symbolic/numeric equivalence (Sobhani et al., 16 Oct 2025).
Code/Tool-Augmented Execution: In CALC-X, correctness is determined by symbolic calculator output embedded within the CoT (Kadlčík et al., 2023).

Empirical studies establish human-level upper bounds (EM ≈ 95% on NUMBERGAME, F1 = 95.2% NumGLUE), and highlight that leading LLMs consistently fall below these (e.g., GPT-3-13B multi-task at 32.7% F1 on NumGLUE) (Mishra et al., 2022).

5. Impact on Model Development and Downstream Performance

Arithmetic reasoning datasets have become the canonical pre-training, fine-tuning, and evaluation grounds for LLM-induced numerical and multi-step reasoning—a critical bottleneck for enabling robust problem-solving in real-world settings.

Effects and Observations:

Inclusion of large-scale arithmetic pre-training corpora (e.g., AM-DeepSeek-R1-Distilled, MsAT, programmatically generated multi-million example sets) yields measurable improvements:
- Models fine-tuned on AM-DeepSeek-R1-Distilled achieve state-of-the-art on AIME2024 (pass@1: 72.7–76.5%) and MATH-500 (accuracy: 96.2–97.0%) (Zhao et al., 25 Mar 2025).
- Intermediate fine-tuning on synthetic arithmetic data increases reasoning accuracy by 3–10% absolute on challenging datasets (GSM8K, MultiArith, ASDiv, SVAMP), with gains saturating at 1–2 epochs (Gangwar et al., 18 Feb 2025).
- Tool-augmented architectures (Calcformers) trained on CALC-X double final answer accuracy versus vanilla LMs (e.g., T5-XL: 19.2 → 39.6% on GSM8K; 20.8 → 53.8% on Ape210K) (Kadlčík et al., 2023).
Transfer is not format-agnostic: models trained on single-format corpora display marked drops when tested on other types (RC → NLI; SQL → arithmetic); NUMBERGAME and NumGLUE highlight the absence of cross-format generalization (Mishra et al., 2022, Mishra et al., 2020).
Distinction between logical and non-logical errors emerges at scale; increased magnitude or out-of-distribution numeric values amplify both classes of errors (e.g., GSM-Ranges: logical error rate up to +14 pp at ℓ=6) (Shrestha et al., 12 Feb 2025).
Multilingual and cross-resource evaluation surfaces significant degradation for low-resource languages, even in arithmetic: MathMist shows a 20–30% absolute gap between high- and low-resource language performance, with error-identification tasks exhibiting high variance (Sobhani et al., 16 Oct 2025).

6. Limitations, Open Challenges, and Future Directions

Despite the proliferation and scale of available datasets, major open problems persist:

Robustness to Numeral Range and Perturbation: LLMs calibrated to training distributions (numbers <1,000) fail with large-input generalization—necessitating construction of tougher, scale-variant corpora (GSM-Ranges) (Shrestha et al., 12 Feb 2025).
Multi-Modal and Cross-Format Generalization: Visual-to-symbolic mapping (MNS, CLEVR-Math) and multi-format fusion (NUMBERGAME, NumGLUE) remain unsolved. No current model matches human performance in multi-modal, multi-lingual, and format-agnostic settings (Zhang et al., 2020, Sobhani et al., 16 Oct 2025).
External Knowledge Integration: Datasets now require explicit retrieval and application of domain formulae, commonsense facts, or dynamic tool calls (FormulaReasoning, NUMBERGAME Type 2, CALC-X). Models lag notably without hybrid neuro-symbolic strategies (Li et al., 2024, Kadlčík et al., 2023).
Catastrophic Forgetting and Stability: Online self-training using preference optimization outperforms supervised regimes for continual improvement (avoiding catastrophic out-of-domain format forgetting), but stability and fast adaptation under data stream remain obstacles (Kadlčík et al., 2024).
Evaluation Leakage, Overlap, and Diversity: The field recognizes the necessity for nightly, cross-source leakage checks (Jaccard clustering, semantic embeddings) to avoid inflated results from test-train overlap (Zhao et al., 25 Mar 2025, Kadlčík et al., 2023).
Benchmark Scarcity for Non-English and Higher-Order Arithmetic: Most benchmarks remain English-centric, with exceptions like CMATH, MathMist, and FormulaReasoning; further expansion into high-school/college curriculum, high-magnitude and formula-based problems is active research.

7. Data Accessibility, Licensing, and Community Usage

Major datasets are released under research-focused, non-commercial licenses (e.g., AM-DeepSeek-R1-Distilled, CMATH: CC BY-NC-SA 4.0), and accessible through Hugging Face or GitHub repositories. Standardized JSONL formats with schema-compliant metadata and train/val/test (where applicable) facilitate integration and continuous benchmarking by the community (Zhao et al., 25 Mar 2025, Wei et al., 2023, Kadlčík et al., 2023, Sobhani et al., 16 Oct 2025, Li et al., 2024).

In summary, arithmetic reasoning datasets constitute the bedrock of model evaluation and targeted skill acquisition for machine reasoning, illuminating fundamental weaknesses in computational linguistics, spurring advances in hybrid neural-symbolic architectures, and providing a rigorous, evolving testbed for arithmetic and general reasoning in both monolingual and multilingual, multi-modal, and cross-domain AI systems.