Agentic mathematical benchmarking for instruction-tuned models
Develop and validate a suitable agentic mathematical benchmark that effectively evaluates the agentic mathematical capabilities of instruction-tuned large language models, including planning, acting, and feedback-oriented reasoning components specific to mathematics.
Sponsor
References
It is important to note that, to date, no suitable agentic mathematical benchmark exists to effectively evaluate the corresponding capabilities of instruction-tuned models. This remains an open challenge in the field.
— Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models
(2512.24618 - Lu et al., 31 Dec 2025) in Section 5.2 Agentic Evaluation of Instruct Model