T-Math Reasoning Benchmark

Updated 12 December 2025

T-Math Reasoning Benchmark is a framework that evaluates LLMs' mathematical reasoning by applying minimal yet semantically critical perturbations.
It distinguishes between simple perturbations that preserve solution structure and hard perturbations that invalidate original solution approaches.
Empirical results highlight significant performance drops on hard perturbed problems, exposing model brittleness and reliance on template reuse.

The T-Math Reasoning Benchmark is a comprehensive framework for evaluating mathematical reasoning capabilities of LLMs under conditions that probe true generalization and robustness. The benchmark is most notably instantiated in the MATH-Perturb suite, which leverages systematically constructed problem perturbations to reveal whether high LLM performance reflects genuine reasoning or surface-level memorization and template application.

1. Benchmark Construction: MATH-P-Simple and MATH-P-Hard

The T-Math Reasoning Benchmark centers on a curated set of level-5 problems, denoted as $\mathcal{M} = \{P_1, \dots, P_{279}\}$ , derived from the original MATH dataset. It introduces two classes of perturbation:

Simple Perturbations (MATH-P-Simple): Each problem $P\in\mathcal{M}$ is transformed by a function $f_s: \mathcal{M} \to \mathcal{M}_\text{simple}$ , which modifies non-essential characteristics (e.g., numerical constants, cosmetic wording). Crucially, the canonical solution trace $S(P)$ is preserved up to trivial arithmetic:

$S(f_s(P)) \approx S(P)$

Hard Perturbations (MATH-P-Hard): The hard perturbation operator $f_h: \mathcal{M} \to \mathcal{M}_\text{hard}$ minimally edits each problem such that the original solution approach is no longer valid:

$S(f_h(P)) \not\equiv S(P)$

All perturbations maintain minimal edit distance from the original, while ensuring a changed numerical or structural answer, thus precluding trivial answer copying (Huang et al., 10 Feb 2025).

Illustrative Example:

Original: Find the area of the circle $(x-3)^2 + (y-7)^2 = 144$ . Solution: $144\pi$ , by computing $\pi \times 12^2$ .
Simple: Only the radius changes, e.g., $(x-3)^2 + (y-7)^2 = 169$ , area is $169\pi$ .
Hard: The task changes to, e.g., finding the circumference, necessitating an entirely different solution trace.

2. Evaluation Protocol

The benchmark employs multiple evaluation modalities to assess robustness:

Zero-Shot Chain-of-Thought (CoT) Prompting: Problems are presented without examples. The model is prompted to "Think step by step…," with the answer expected in a boxed format (e.g., $\boxed{\dots}$ ).
Equivalence Checking: Model outputs are normalized and verified for mathematical equivalence using the SymPy library.
Metrics: The primary metric is overall accuracy, i.e., the percentage of correctly solved problems. Further distinctions are made between train-split and test-split items.
In-Context Learning (ICL): One-shot demonstrations using the original (unperturbed) version of the problem are prepended. Performance is then measured on both $\mathcal{M}_\text{simple}$ and $\mathcal{M}_\text{hard}$ to assess sensitivity to demonstration type.

This protocol is designed to prevent overestimation of model capability due to answer-form bias or exploitation of explicit demonstration patterns (Huang et al., 10 Feb 2025).

3. Quantitative Outcomes and Failure Modes

Extensive empirical results across 18 LLMs reveal pronounced out-of-distribution brittleness:

Model	Original	MATH-P-Simple	MATH-P-Hard
o1-mini	94.27%	94.98%	78.49% (↓15.78%)
Gemini-2.0-flash-thinking	92.47%	91.04%	78.14% (↓14.33%)

All models exhibit a >10–25 percentage point drop on MATH-P-Hard, while performance remains nearly unchanged (≤5 pp drop) on MATH-P-Simple. This substantial gap demonstrates that simple resampling or cosmetic changes do not approximate the hardness of "real" distributional shifts.

4. Memorization and Template Reuse: Diagnostic Insights

The T-Math Reasoning Benchmark exposes a novel "blind template reuse" failure mode:

Phenomenon: LLMs frequently recognize and apply compact solution templates learned during pretraining, such as mapping counting tasks to $\binom{n}{k}$ formulae, even when critical problem conditions have shifted.
Observable Diagnostics: If the solution trace for a hard-perturbed problem $f_h(P)$ exactly mirrors $S(P)$ despite semantic misalignment, this indicates erroneous template deployment.
Prevalence: Approximately 40% of o1-mini's errors and 25% of Claude-3.5-Sonnet's errors on MATH-P-Hard are attributed to this pathology.

This evidence reinforces the limitation of "verbatim" recall-based memorization checks and necessitates finer-grained process-level inspection (Huang et al., 10 Feb 2025).

5. Benchmark Design Recommendations and Comparative Perspective

MATH-Perturb demonstrates that benchmarks relying solely on numerical resampling or cosmetic modifications severely underestimate the real-world brittleness of LLM mathematical reasoning. To address these deficits:

Benchmark Construction: Future benchmarks should employ minimal but semantically critical hard perturbations to firmly invalidate solution trace transfer.
Automation Tooling: Automated detection of template collapse, via trace-level comparison, is recommended for flagging suspiciously invariant solution patterns.
Model Development: Embedding meta-reasoning modules capable of diagnosing template applicability, alongside proof verification components (e.g., sample-based subproblem checks), is encouraged to mitigate overgeneralization.

Against other benchmarks—such as ConceptMath, which focuses on concept-wise diagnostic granularity (Wu et al., 22 Feb 2024), or HARP, which emphasizes solution diversity and competition-calibrated challenge (Yue et al., 11 Dec 2024)—MATH-Perturb's distinctive contribution is its systematic probing of abstraction transfer and robustness under adversarial task structure variation.

6. Significance and Future Directions

The T-Math Reasoning Benchmark establishes an upper bound for current LLM mathematical generalization by requiring models to determine not only whether they can reproduce familiar solution skeletons, but also to detect when such skeletons are inadequate. This reframes the evaluation goal from resampled pattern matching to adaptive problem analysis—an essential prerequisite for models deployed in open-domain or adversarial settings.

This suggests that the T-Math Reasoning Benchmark, typified by MATH-Perturb, is a critical instrument for quantifying foundational limits and for guiding the next stage of robust mathematical model development (Huang et al., 10 Feb 2025).