AgentMath: Tool-Augmented Mathematical Reasoning

Updated 4 July 2026

AgentMath is a framework that integrates LLM reasoning with tool execution to systematically decompose and solve complex mathematical problems.
It employs structured workflows—such as planner–reasoner–executor and condition mining—to separate subfunctions and enhance stateful, iterative solution strategies.
The system leverages tool-augmented supervised training and reinforcement learning, yielding significant performance gains on competition benchmarks.

AgentMath, in the narrow sense, is the tool-augmented framework that trains LLMs to interleave natural-language reasoning with real-time code execution for mathematical problem solving (Luo et al., 23 Dec 2025). In a broader sense—Editor’s term—it denotes the literature that reconceives mathematics as an agentic workflow: problems are parsed, decomposed, grounded in memory or literature, executed with tools, checked by specialized verifiers, and iteratively revised rather than answered in a single pass (Liao et al., 2023, Liu et al., 20 May 2025, Zhao et al., 20 May 2026). This line of work spans competition mathematics, process evaluation, multimodal educational diagnosis, real-world mathematical modeling, benchmark construction, and research-level proof search.

1. Core premises of agentic mathematical reasoning

A common premise across the literature is that difficult mathematics is not a monolithic next-token prediction problem. One formulation states that LLMs face challenges in solving complex mathematical problems that require capacities to parse statements, associate domain knowledge, perform compound logical reasoning, and integrate intermediate rationales, and that tackling all of these at once can lead to confusion in generation (Liao et al., 2023). Another formulation argues that mathematical modeling differs from ordinary mathematical reasoning because the problem is not already formalized: the agent must analyze a real-world scenario, choose assumptions, define variables, construct a formulation, and only then solve it (Liu et al., 20 May 2025).

Within this perspective, the defining operation of AgentMath is decomposition. Different systems instantiate decomposition differently—planner–reasoner–executor pipelines, conversational proxy loops, retrieval modules, verifier loops, or literature-grounded proof workflows—but they share the assumption that mathematical competence improves when subfunctions are explicitly separated. Tool use then becomes a natural extension rather than an auxiliary add-on: symbolic manipulation, arithmetic, search, and simulation are delegated to external executors while the LLM manages strategy, coordination, and repair (Luo et al., 23 Dec 2025).

A second recurring premise is that mathematical work is stateful. Agentic systems therefore preserve evolving condition sets, retrieved examples, proof drafts, code artifacts, or feedback histories, rather than treating each prompt as independent. This statefulness is especially salient in long-horizon settings such as mathematical modeling and research-level proof search, where intermediate failures, partial hypotheses, and dependency structure materially affect downstream reasoning (Lei et al., 2024, Zhao et al., 20 May 2026).

2. Early architectures and canonical workflows

Early agentic math systems established several architectural motifs that remained influential. MathChat introduced a conversational framework with an LLM assistant and a user proxy agent that extracts code, executes it sequentially, returns results or errors, retains previously valid code, and terminates when a final answer is emitted in \boxed{} format (Wu et al., 2023). On all level-5 problems from the MATH test set, excluding Geometry, MathChat reached 44.71% total accuracy, compared with 39.60% for Program Synthesis and 37.67% for Program of Thoughts, which the paper summarized as a roughly 6% improvement over previous tool-using prompting methods (Wu et al., 2023).

The PRER framework formalized mathematical solving as a zero-shot agent workflow named Planner–Reasoner–Executor–Reflector and instantiated two MathAgents with different action granularities: MathAgent-M adapts its actions to LLMs, while MathAgent-H aligns with humankind (Liao et al., 2023). Reported gains were substantial: on MiniF2F, 53.9% to 66.2%; on MATH, 49.8% to 59.0%; and on level-5 MATH, 23.2% to 35.4% against GPT-4 (Liao et al., 2023). This design made explicit that planning, execution, and reflection can be treated as distinct computational roles rather than latent behaviors inside one completion.

MACM sharpened the decomposition further by shifting from thought search to “condition mining.” Its three roles—Thinker, Judge, and Executor—first extract known conditions and the objective, then iteratively enlarge a validated condition set until the Judge determines that the objective is reachable, with a maximum of five iterations (Lei et al., 2024). On the hardest level-5 MATH problems, GPT-4 Turbo improved from 54.68% to 76.73%; on the overall MATH benchmark, category-wise accuracies included 96.07% in Algebra, 97.95% in Counting and Probability, 62.74% in Geometry, and 98.04% in Number Theory, with the biggest improvement reported on Number Theory at +23.53% (Lei et al., 2024). The paper’s central contrast with Tree of Thought and Graph of Thought was that MACM is condition-centric and prompt-reusable across domains rather than tied to task-specific search structures.

MathLearner introduced a different axis of agenticity by emulating inductive learning from solved examples. It separates a learning module from an application module, stores modified program-like solutions and feature vectors in a vector database, retrieves by structural features rather than keywords, and reuses retrieved solution ideas when solving new problems (Xie et al., 2024). On 150 precalculus questions from MATH, MathLearner reported 50% global accuracy versus 41.33% for the Chain-of-Thought baseline, a 20.96% profitability gain, 51.55% precision accuracy, and a target achievement rate of 17.54% (Xie et al., 2024). Here the agentic gain comes less from dialogue or planning and more from explicit externalized memory.

A more general-purpose agent framework, Infant Agent, showed that similar design principles can transfer from software tasks to olympiad mathematics. Using a brain-level agent for reasoning and a hand-level agent for execution, with the loop Input $\rightarrow$ Reasoning $\rightarrow$ Task $\rightarrow$ Execution $\rightarrow$ Evaluation $\rightarrow$ Summary $\rightarrow$ Stop, it raised GPT-4o’s AIME-2024 accuracy from 13.3% to 37%, matching o1-preview at lower reported cost (Lei et al., 2024). This suggests that the advantages of AgentMath are not restricted to math-specialized architectures; hierarchical delegation and memory compression can also act as a general reasoning scaffold.

3. AgentMath as a tool-augmented training and inference system

The 2025 AgentMath framework places tool use at the center of both training and inference. Its basic protocol alternates > ... spans for natural-language reasoning, <code> ... </code> spans for executable code, and <interpreter> ... </interpreter> spans for execution results or errors (Luo et al., 23 Dec 2025). The formalization is agent-environment interaction: for a problem $P$ , the state is $s_t = (P, \tau_{t-1})$ , the policy samples an action $z_t \sim \pi_\theta(\cdot \mid s_t)$ , and if that action is code, the environment returns $o_t = \mathcal{E}(c_t)$ (Luo et al., 23 Dec 2025). This representation converts mathematical reasoning from static text generation into a trajectory of thoughts, code actions, and observations.

A central innovation is automated synthesis of tool-augmented supervised data from ordinary chain-of-thought corpora. AgentMath begins from pure-text reasoning data, filters for contamination and difficulty, then applies an injection function that replaces computationally intensive reasoning steps with executable code and simulated interpreter output (Luo et al., 23 Dec 2025). The pipeline subsequently performs format consistency correction, code executability verification, environmental feedback alignment using real interpreter outputs, and tool-usage rationality assessment. Failed executions are not discarded; instead, they are transformed into self-correction trajectories in which the teacher diagnoses the error, repairs the code, re-executes it, and resumes reasoning (Luo et al., 23 Dec 2025). The final synthetic SFT corpus contains 316k samples, with an average of 8.3 tool calls per sample and an average length of 16.9K tokens (Luo et al., 23 Dec 2025).

The reinforcement-learning stage treats tool use as a learnable policy rather than a fixed prompt convention. AgentMath uses Group Relative Policy Optimization and combines answer correctness with a tool-use reward that is only applied when the final answer is correct (Luo et al., 23 Dec 2025). During both SFT and RL, interpreter tokens are masked so that the model is trained on its own thoughts and code rather than on deterministic environment outputs (Luo et al., 23 Dec 2025). The paper reports emergent code self-correction during RL, together with increases in trajectory length, tool calls, and code ratio as training progresses, indicating that the policy learns not merely to call tools, but to incorporate execution feedback into later decisions (Luo et al., 23 Dec 2025).

The system contribution is also unusually prominent. Tool-heavy mathematical RL produces average training trajectories around 24k tokens, about 27 tool invocations per problem at temperature 1.0, and in some cases up to 96k tokens and 96 tool calls (Luo et al., 23 Dec 2025). To manage this, AgentMath introduces request-level asynchronous rollout scheduling, agentic partial rollout, and prefix-aware weighted load balancing. The paper reports that distributed sandboxing reduced tool-call latency from 175 s to 1.2 s, that partial rollout alone gave about 2.2–2.5× speedup, and that the full system achieved 4–5× speedup, making RL on ultra-long sequences feasible (Luo et al., 23 Dec 2025).

Empirically, the framework is positioned as state of the art on competition-style benchmarks under avg@32 evaluation. AgentMath-30B-A3B attains 90.6% on AIME24, 86.4% on AIME25, and 73.8% on HMMT25; AgentMath-8B attains 89.8%, 84.7%, and 71.3%; and the SFT-only AgentMath-235B-A22B reaches 93.4%, 90.8%, and 81.7% (Luo et al., 23 Dec 2025). The same paper reports that tool-augmented RL is markedly more training-efficient than a text-only RL baseline: after roughly 400 steps, the tool-augmented model reached 76.2% on AIME24 and 67.5% on AIME25, whereas the text-only baseline required about 1600 steps to reach 68.7% and 57.5% (Luo et al., 23 Dec 2025). In this formulation, AgentMath is not simply “LLM plus calculator”; it is an end-to-end training recipe for when, how, and why to externalize computation.

4. Expansion beyond answer generation

As the literature matured, AgentMath techniques were extended from answer production to process evaluation. StepMathAgent was proposed specifically to address the inadequacy of answer-only grading on proof and open-ended tasks (Yang et al., 13 Mar 2025). It decomposes a solution $\rightarrow$ 0 into logical steps $\rightarrow$ 1, classifies each step as correct, incorrect, or correct-but-meaningless, aggregates the step scores into a final score, and then constructs a Tree-of-Error that traces causal error chains (Yang et al., 13 Mar 2025). StepMathBench contains 1,000 step-divided process evaluation instances derived from 200 high-quality math problems, and inter-annotator agreement reaches 95% (Yang et al., 13 Mar 2025). This line of work shifts the unit of analysis from final answers to reasoning trajectories themselves.

A second expansion concerns multimodal educational diagnosis. The multimodal MathAgent framework addresses the problem of identifying the first wrong step in a student solution and classifying the error as VIS, CAL, REAS, KNOW, or MIS (Yan et al., 23 Mar 2025). Its three sequential phases—Image-Text Consistency Validator, Visual Semantic Interpreter, and Integrative Error Analyzer—separate cross-modal grounding from diagnostic reasoning (Yan et al., 23 Mar 2025). Evaluated on 2,500 real multimodal math questions from an educational platform, the framework improved STEP accuracy by about 5.2% and CATE accuracy by about 3.2% on average across tested MLLMs, and it was reported as deployed in a platform serving over one million K-12 students (Yan et al., 23 Mar 2025). This shows that agentic decomposition can target not only solving problems, but also localizing human mistakes.

Agentic methods also broadened the target task from formalized reasoning to open-ended mathematical modeling. MM-Agent formalizes real-world mathematical modeling as a four-stage workflow—open-ended problem analysis, structured model formulation, computational problem solving, and report generation—and evaluates it on MM-Bench, a benchmark of 111 MCM/ICM problems from 2000 to 2025 spanning ten domains (Liu et al., 20 May 2025). The framework introduces the Hierarchical Mathematical Modeling Library and a hierarchical actor-critic modeling optimization loop, and it reports an 11.88% improvement over human expert solutions while requiring only 15 minutes and \$0.88 per task using GPT-4o (Liu et al., 20 May 2025). Under official contest protocols, MM-Agent also assisted two undergraduate teams in winning the Finalist Award, defined as the top 2.0% among 27,456 teams in MCM/ICM 2025 (Liu et al., 20 May 2025). Here AgentMath becomes a copilot for abstraction, not just symbolic derivation.

At the research frontier, two systems push agentic mathematics into open-ended theorem work. The AI co-mathematician provides an asynchronous, stateful workspace with a project coordinator, workstream coordinators, specialized sub-agents, progressive disclosure in the interface, explicit tracking of failed hypotheses, and outputs in native mathematical artifacts such as LaTeX working papers (Zheng et al., 7 May 2026). It reportedly scored 48% on FrontierMath Tier 4, with 23 correct out of 48, which the paper described as a new high score among evaluated AI systems (Zheng et al., 7 May 2026). RMA, by contrast, focuses specifically on research-level proof solving through specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, coordinated by initializer, proposer, and verifier agents through shared structured memory (Zhao et al., 20 May 2026). On the First Proof benchmark of ten expert-contributed research problems, RMA solved eight out of ten and outperformed GPT-5.2R and Aletheia in expert evaluation (Zhao et al., 20 May 2026). These systems indicate that, at the high end, AgentMath is increasingly about iterative collaboration, provenance, and literature grounding rather than benchmark-style answer extraction.

5. Data generation, benchmarks, and the empirical landscape

A major branch of the field treats AgentMath not as a solver but as a generator of training data and benchmarks. AgenticMath proposes a four-stage pipeline—Seed Question Filter, Agentic Question Rephrase, Answer Augment, and Question and Answer Evaluation—to build higher-quality mathematical question–answer pairs for supervised fine-tuning (Liu et al., 22 Oct 2025). The paper argues that question quality is itself a bottleneck and reports that fine-tuning 3B–8B models on only 30K–60K AgenticMath samples can achieve competitive or superior results relative to baselines trained on 400K or 2.3M samples (Liu et al., 22 Oct 2025). A related but structurally different synthesis framework formulates problem generation as optimization over a constraint graph: a Legislator composed of Proposer, Critic, and Moderator evolves structured blueprints, and an Executor instantiates them as natural-language problems (Yu et al., 13 Apr 2026). With only 1K synthesized samples, models fine-tuned on this data reportedly outperformed LIMO and s1K across eight benchmarks, with especially strong gains on harder settings (Yu et al., 13 Apr 2026).

Code2Math extends this generative perspective by asking whether code agents can autonomously evolve seed math problems into new ones that are both solvable and harder (Guo et al., 3 Mar 2026). Its three-agent pipeline—Evolution Agent, Solvability Verification Agent, and Difficulty Verification Agent—uses exploratory code execution as a mathematical environment and measures both certified solvability and increased burden of discovery (Guo et al., 3 Mar 2026). The paper reports that evolved problems are structurally distinct, typically require higher token consumption, and are harder even for strong solvers such as GPT-5.2-High (Guo et al., 3 Mar 2026). In this subliterature, AgentMath acts as a benchmark constructor and difficulty amplifier.

Benchmarks have also become diagnostic instruments for understanding failure modes. AgentCoMa constructs compositional tasks requiring one commonsense step and one math step in realistic scenarios, and finds that models that solve both steps in isolation still suffer an average performance drop of about 29–30% when the two are combined (Alazraki et al., 27 Aug 2025). Non-expert human annotators, by contrast, solve the compositional questions and their substeps with similarly high accuracy (Alazraki et al., 27 Aug 2025). This result is significant because it isolates a form of mixed-type compositional brittleness that is largely invisible on pure-math or pure-commonsense benchmarks.

The following table summarizes representative systems and reported results across the AgentMath landscape.

System	Core mechanism	Representative reported result
MathChat (Wu et al., 2023)	Conversational LLM + user proxy with sequential code execution	44.71% on level-5 MATH
MACM (Lei et al., 2024)	Thinker–Judge–Executor condition mining	54.68% $\rightarrow$ 2 76.73% on level-5 MATH
AgentMath (Luo et al., 23 Dec 2025)	Tool-augmented SFT + agentic RL with code execution	90.6% / 86.4% / 73.8% on AIME24 / AIME25 / HMMT25
AI co-mathematician (Zheng et al., 7 May 2026)	Asynchronous, stateful mathematical workspace	48% on FrontierMath Tier 4
RMA (Zhao et al., 20 May 2026)	Initializer–proposer–verifier workflow with shared structured memory	8/10 on First Proof

Taken together, these results show that the field has diversified along three empirical axes: stronger solvers, richer evaluators, and more controlled data-generation pipelines. A plausible implication is that “AgentMath” now names a research program as much as a single method: mathematical performance gains increasingly depend on workflow design, data curation, and evaluation protocols, not only on raw model scale.

6. Limitations, failure modes, and open problems

Despite the reported gains, the literature is explicit about several limitations. Inference cost and latency remain persistent concerns. MACM requires multiple LLM invocations and is therefore slower and more expensive than single-pass prompting (Lei et al., 2024). AgentMath requires ultra-long trajectories, many tool calls, and distributed execution infrastructure to sustain training efficiency (Luo et al., 23 Dec 2025). MM-Agent likewise notes the computational cost of large agentic workflows in real-world modeling (Liu et al., 20 May 2025). This suggests that current AgentMath systems often trade simplicity for controllability.

Geometry, multimodal grounding, and heterogeneous reasoning composition remain difficult. MACM reports that geometry is still a weak point because GPT-4 Turbo struggles to understand relations among figures and to design code accordingly (Lei et al., 2024). The multimodal MathAgent paper similarly shows that generic captions are inadequate for mathematical diagrams and that specialized visual interpretation is the most important module in ablation (Yan et al., 23 Mar 2025). AgentCoMa further shows that models can possess both the commonsense and arithmetic subskills yet fail when the two must be integrated, with attention and neuron analyses indicating ineffective mixed-type circuit activation (Alazraki et al., 27 Aug 2025). A plausible implication is that future AgentMath systems will need finer-grained routing across reasoning modalities, not just more tool use.

Evaluation is another unresolved problem. StepMathAgent was introduced precisely because final-answer metrics are inaccurate and uninterpretable for proof or open-ended tasks (Yang et al., 13 Mar 2025). Code2Math acknowledges that its evolved problems were not manually audited by human experts and rely on internal verification plus GPT-5.2-High as an external judge (Guo et al., 3 Mar 2026). RMA requires expert mathematician assessment because automatic checking is insufficient for research-level proofs, yet its benchmark contains only ten problems and therefore cannot provide broad statistical coverage (Zhao et al., 20 May 2026). The field is moving toward richer process-based and expert-based evaluation, but no uniform standard has emerged.

Fairness, leakage, and provenance also matter more as tasks become research-like. RMA explicitly enforces literature filtering, context isolation, sandboxed execution, and temporal control to prevent leakage from known benchmark solutions (Zhao et al., 20 May 2026). The AI co-mathematician emphasizes version history, provenance annotations, review cycles, and preservation of failed hypotheses as first-class artifacts (Zheng et al., 7 May 2026). These choices indicate that, at research scale, a mathematical agent is judged not only by whether it is right, but by whether its route to the answer is inspectable and epistemically disciplined.

The current trajectory of the field points toward longer-horizon, more stateful systems with tighter integration of tools, retrieval, verification, and human collaboration. The available evidence suggests that the most robust gains do not arise from any single module—planner, code interpreter, retriever, or verifier—but from their coordinated interaction within a structured workflow (Zhao et al., 20 May 2026, Luo et al., 23 Dec 2025). In that sense, AgentMath is best understood not as a solitary architecture but as a general design principle for turning mathematical reasoning into a managed computational process.