MathGAP Framework for LLM Math Reasoning

Updated 24 August 2025

MathGAP is a framework that formalizes arithmetic word problems into logical models with configurable proof complexity (depth, width, nonlinearity).
It employs automated proof tree synthesis to generate chain-of-thought solutions, exposing LLM limitations in handling multi-step reasoning tasks.
The framework rigorously evaluates LLM performance by varying factors like sentence ordering and inference structure to reveal patterns of degradation.

The MathGAP Framework provides a systematic, programmatic approach for generating math word problems with automatically constructed chain-of-thought (CoT) solutions whose underlying arithmetic proofs exhibit user-specified structural complexity. Designed explicitly to probe the reasoning generalization of LLMs, MathGAP enables precise, data-driven exploration of how models cope with increases in proof tree depth, width, nonlinearity, and variation in sentence ordering. By sampling from a controlled space of logical world models and inference rules, MathGAP creates rigorous evaluation environments that expose limitations and sensitivities in LLM mathematical reasoning, highlighting their patterns of degradation and the "noisy generalization" that arises outside familiar problem distributions (Opedal et al., 17 Oct 2024).

1. Problem Formalization and Logical World Model

At the core of MathGAP is the formalization of arithmetic word problems as logical world models. Each problem is represented as a sequence of logical forms:

Each sentence is encoded as a predicate with argument-value lists, e.g., $\text{cont}(\text{Alice}, 5, \text{apple})$ for "Alice has 5 apples".
Additional predicates capture comparative, transfer, part-whole, or complex relationships, e.g., $\text{comp}(b, a, q_2, e)$ describes "Bob has $q_2$ more $e$ than Alice".

This encoding ensures that each statement is machine-interpretable and provides a deterministic substrate for proof synthesis. The logical forms serve both as "axioms" and as the foundation for constructing formal proofs.

2. Automated Proof Tree Synthesis and Reasoning Traces

A distinctive feature of MathGAP is its procedural generation of deductive proof traces—specifically, proof trees constructed by recursively applying a fixed set of inference rules. Proofs are built with the following semantics:

Leaf nodes: The initial problem sentences, encoded as logical forms.
Internal nodes: Applications of inference rules that derive new facts from premises.
Root node: The logical form corresponding to the answer, matching the question's target.

A standard inference rule is represented in proof-theoretic notation: $\begin{prooftree} \AxiomC{%%%%5%%%%} \AxiomC{%%%%6%%%%} \BinaryInfC{%%%%7%%%%} \end{prooftree}$ This mechanism supports a diversity of arithmetic reasoning types (accumulation, comparison, transfer, composition), each specified by a corresponding inference template. The chain-of-thought solution is generated algorithmically by traversing the proof tree in post-order—providing both the numeric answer and an explicit, multi-step rationale.

3. Configurable Proof Complexity Dimensions

MathGAP's data generation pipeline allows systematic control over multiple orthogonal axes of proof complexity:

Proof Dimension	Description	Example Control
Depth	Height of the proof tree (inference steps from axiom to answer)	1 to ~10+
Width	Number of axioms/declarative sentences (breadth of the proof bases)	2 to larger $n$
Linearity	Whether proofs are strictly sequential (linear) or combine subproofs (nonlinear)	Enforce rules
Sentence Ordering	Sequence in which axioms appear (canonical or permuted)	Swap positions

This configurability supports targeted out-of-distribution (OOD) evaluation:

Model performance can be measured by training or providing examples with limited proof depth or width and then testing on instances with greater depth, more axioms, or altered sentence order.
Nonlinearity arises when inference rules operate on previously inferred (non-axiom) facts, simulating compound reasoning required in multi-step arithmetic.

4. Evaluation of LLM Reasoning

Using MathGAP, an extensive set of experiments has been conducted to characterize LLM performance as a function of arithmetic proof complexity (Opedal et al., 17 Oct 2024):

Depth and width: There is a significant decrease in solution accuracy as the proof tree's depth or width increases, particularly in nonlinear settings. Even top-tier LLMs (e.g., GPT-4) encounter steep degradation for higher-order or compositional reasoning.
Effect of in-context examples: Exposure to a diverse set of proofs spanning a range of complexities provides some marginal benefit, yet does not fully mitigate OOD failures. In some cases, zero-shot performance is comparable to few-shot configurations with in-distribution examples.
Sentence ordering: Accuracy is sensitive to the reordering of problem sentences. Problems where a single sentence is shifted from the middle to the front are particularly challenging, suggesting attrition of model reliability in noncanonical presentations.

The framework quantifies reasoning challenges via reproducible, interpretable metrics—answer accuracy stratified by proof dimension and reasoning type.

5. Diagnostic Insights and Implications

Analysis facilitated by MathGAP reveals several diagnostic insights:

LLMs tend to overfit to prototypical, shallow, or linear proof structures, displaying "noisy generalization" when confronted with deeper, broader, or less orderly proofs.
The results suggest that LLMs' apparent robustness on standard benchmarks does not reflect an ability to systematically perform higher-order reasoning or proof generalization.
Sensitivity to presentation order and logical structure indicates reliance on shallow pattern matching or local coherence rather than abstract deductive planning.

These findings underline the need for both more rigorous evaluation suites (of which MathGAP is an instantiation) and for model development strategies that address multi-step, compositional reasoning.

6. Future Directions and Research Opportunities

The systematicity of the MathGAP approach provides several directions for future research:

As LLMs improve, MathGAP can continually spawn new benchmarks by escalating proof complexity or introducing novel composition of inference rules.
The observed vulnerabilities inspire investigation into alternative prompting strategies, curriculum learning, and hybrid neuro-symbolic models that directly encode proof-theoretic structure.
The fine-grained breakdown of error types can inform targeted model interventions or the development of auxiliary mechanisms for handling complex reasoning tasks.

This suggests that MathGAP, by formalizing arithmetic word problem evaluation through logical forms and programmatic proof trees, constitutes a robust paradigm for both benchmarking and advancing LLM mathematical reasoning.

7. Significance within Mathematical and AI Research

MathGAP bridges formal logic, programmatic data generation, and empirical LLM evaluation:

It provides a future-proof benchmark that scales with the field's progress, ensuring that reasoning capabilities—not mere pattern matching—are assessed.
By quantifying LLM generalization across controlled complexity gradients, MathGAP enables principled diagnosis of current model architectures.
Its methodology can be extended to broader domains of mathematical and formal inference, serving as a foundation for comprehensive evaluation in symbolic reasoning, automated theorem proving, and scientific question answering.

The MathGAP Framework thus represents a pivotal resource for both the theoretical paper and practical improvement of machine reasoning.

PDF Markdown Chat (Pro)

References (1)

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MathGAP Framework.