Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MathQA-Python Dataset Benchmark

Updated 8 July 2025
  • MathQA-Python is a large-scale dataset that translates natural language math problems into executable Python programs.
  • It is constructed by converting DSL-based operation programs, ensuring clean, step-by-step evaluation through test cases.
  • Benchmark results show notable gains through fine-tuning and scaling, while highlighting challenges in semantic grounding and multi-step reasoning.

The MathQA-Python dataset is a large-scale benchmark designed to evaluate the ability of models to synthesize correct Python programs from natural language mathematical word problems. Derived as a Python variant of the original MathQA dataset, MathQA-Python represents a complex synthesis task that tests not only program generation capabilities but also mathematical reasoning and semantic understanding. The dataset is widely utilized for research in program synthesis, natural language to code translation, and the assessment of LLMs’ mathematical competencies.

1. Dataset Composition and Structure

MathQA-Python consists of 23,914 total problems, with 19,209 training examples, 2,822 validation examples, and 1,883 test examples after a filtration process that ensures consistent evaluation (2108.07732). Each entry in the dataset originates from a complex mathematical word problem, mirroring those found in MathQA, but with the formal operation programs translated from a domain-specific language (DSL) into executable Python code.

In comparison to datasets such as MBPP (Mostly Basic Programming Problems), MathQA-Python is characterized by “straight-line” programs—that is, the generated Python typically does not involve control flow structures (loops or conditionals), but focuses on implementing multi-step arithmetic or algebraic solutions directly reflecting the multi-stage reasoning found in the problem text. The word problems draw from diverse mathematical domains, including algebra, geometry, physics, and general quantitative reasoning.

2. Construction Methodology

MathQA-Python is created by systematically translating the operation-based DSL programs from the original MathQA dataset into Python syntax (2108.07732). The original MathQA dataset, in turn, builds upon, de-noises, and augments the AQuA dataset by removing unsolvable or incomplete questions, resulting in cleanly annotated math word problems with step-by-step operation programs (1905.13319). Each MathQA problem includes a natural language question, several multiple-choice answers, and a fully specified operation program that details intermediate solution steps. This operation program is then converted into Python, yielding a ground-truth solution script for the MathQA-Python dataset.

Problems are filtered during construction based on their ability to be consistently evaluated with test cases, and the final code outputs are expected to match the correct answer present in the original multiple-choice options. This translation and filtering process ensures that only those problems which can be expressed and automatically validated using Python code are included in MathQA-Python.

3. Evaluation Framework and Results

LLMs are evaluated on MathQA-Python using both few-shot prompting and fine-tuning paradigms (2108.07732). In few-shot scenarios, model performance is measured by providing a handful of example input–output pairs alongside each prompt. The largest model tested (137B parameters) achieves 33.4% accuracy in this setup, meaning that a generated solution passes all provided test cases for approximately one third of the problems.

Fine-tuning on the training split of MathQA-Python results in a marked increase in performance. When the same 137B model is fine-tuned on ground-truth programs, accuracy rises to 81.2% on the Python-formatted test set and 83.8% on the DSL-formatted variant. This substantial increase demonstrates the efficacy of data-driven adaptation for complex mathematical translation tasks.

Performance is evaluated as the fraction of test problems for which at least one sampled program passes all associated test cases, formalized as:

Accuracy=Number of problems solved by any sampleTotal number of test problems\text{Accuracy} = \frac{\text{Number of problems solved by any sample}}{\text{Total number of test problems}}

The experiments indicate a log-linear scaling of synthesis performance with model size:

Performancealog(ModelSize)+b\text{Performance} \propto a \cdot \log(\text{ModelSize}) + b

where aa and bb are constants determined from empirical fits across different model capacities.

4. Model Behavior, Semantic Grounding, and Feedback

Despite the high pass rates in the fine-tuned regime, error analysis reveals that model competence on MathQA-Python is sensitive to problem complexity and linguistic subtlety. Many model errors originate from omissions in intermediate calculation steps or misinterpretations of nuanced mathematical or verbal constraints. Tasks involving multi-step reasoning or intricate arithmetic operations present the largest challenge.

A distinctive aspect of the evaluation is the assessment of semantic grounding—whether models can predict the output of a given program on test inputs, thereby exhibiting an internal “understanding” beyond surface-level code generation. Results indicate that even models capable of synthesizing correct programs for many test cases generally fail at simulation: their predictions of concrete program outputs are significantly less accurate than their code-generation accuracies.

Human-in-the-loop feedback further impacts performance. Interactive dialog experiments—wherein a human inspects generated code and offers succinct natural language guidance—demonstrate that error rates can be halved with minimal prompting. This suggests that while models may initially miss solution details, targeted correction can lead to both accurate code and more faithful stepwise explanations.

5. Research Significance and Challenges

MathQA-Python occupies a unique position among code synthesis benchmarks due to its focus on complex, mathematically framed natural language problems requiring precise arithmetic logic rather than general programming constructs. Model performance on the dataset highlights both the progress in code synthesis—evidenced by substantial gains from fine-tuning and increased model scale—and current deficits, particularly in semantic understanding and multi-hop reasoning.

Many errors remain on tasks requiring extended logical chains or domain-specific knowledge that cannot be trivially inferred from natural language. This suggests that future models must address robustness in parsing nuanced verbal statements, representing and executing multi-stage arithmetic reasoning, and developing deeper semantic grounding.

A plausible implication is that MathQA-Python will continue to serve as a challenging benchmark for the development of models that bridge the gap between linguistic comprehension and programmatic reasoning.

In the context of natural language to code tasks, MathQA-Python stands apart from datasets like MBPP, which focuses on simple, mostly procedural programming tasks suitable for novice programmers (2108.07732). While MBPP’s problems often involve loops, conditionals, and typical introductory logic, MathQA-Python centers on “straight-line” mathematical code derived directly from word problem decompositions.

Compared to source code comprehension datasets such as CodeQA (2109.08365), which frames tasks as open-domain QA over code snippets with answers generated via dependency parsing and semantic role labeling, MathQA-Python is purpose-built for evaluating precise implementation of mathematically motivated verbal descriptions. The presence of ground-truth executable programs distinguishes MathQA-Python from free-form QA datasets, rendering it especially applicable to quantitative evaluation of synthesis and reasoning in code models.

7. Access and Utilization

The MathQA-Python dataset is available at https://math-qa.github.io/math-QA/ (2108.07732). The accompanying resources include detailed annotation methodologies, guidelines for natural language to program alignment, and tools for evaluating generated code. The dataset’s availability and detailed structure make it a valuable resource for benchmarking advances in program synthesis, mathematical reasoning, and linguistically informed machine learning systems.