Agentic mathematical benchmarking for instruction-tuned models

Develop and validate a suitable agentic mathematical benchmark that effectively evaluates the agentic mathematical capabilities of instruction-tuned large language models, including planning, acting, and feedback-oriented reasoning components specific to mathematics.

Background

In assessing agent capabilities of instruction-tuned models, the paper evaluates Deep Research, code, and tool-use tasks. However, for mathematics, the authors note the absence of an appropriate agentic benchmark that measures capabilities beyond final-answer accuracy, such as planning and trajectory-level verification.

The lack of a standardized agentic mathematics benchmark for instruction-tuned models is identified as a field-wide gap that hinders rigorous evaluation of native agentic behaviors in mathematical reasoning settings.

References

It is important to note that, to date, no suitable agentic mathematical benchmark exists to effectively evaluate the corresponding capabilities of instruction-tuned models. This remains an open challenge in the field.

Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models (2512.24618 - Lu et al., 31 Dec 2025) in Section 5.2 Agentic Evaluation of Instruct Model