Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 92 tok/s

Gemini 2.5 Pro 47 tok/s Pro

GPT-5 Medium 19 tok/s

GPT-5 High 18 tok/s Pro

GPT-4o 96 tok/s

GPT OSS 120B 473 tok/s Pro

Kimi K2 26 tok/s Pro

2000 character limit reached

rStar2-Agent: Agentic Reasoning Technical Report (2508.20722v1)

Published 28 Aug 2025 in cs.CL

Abstract: We introduce rStar2-Agent, a 14B math reasoning model trained with agentic reinforcement learning to achieve frontier-level performance. Beyond current long CoT, the model demonstrates advanced cognitive behaviors, such as thinking carefully before using Python coding tools and reflecting on code execution feedback to autonomously explore, verify, and refine intermediate steps in complex problem-solving. This capability is enabled through three key innovations that makes agentic RL effective at scale: (i) an efficient RL infrastructure with a reliable Python code environment that supports high-throughput execution and mitigates the high rollout costs, enabling training on limited GPU resources (64 MI300X GPUs); (ii) GRPO-RoC, an agentic RL algorithm with a Resample-on-Correct rollout strategy that addresses the inherent environment noises from coding tools, allowing the model to reason more effectively in a code environment; (iii) An efficient agent training recipe that starts with non-reasoning SFT and progresses through multi-RL stages, yielding advanced cognitive abilities with minimal compute cost. To this end, rStar2-Agent boosts a pre-trained 14B model to state of the art in only 510 RL steps within one week, achieving average pass@1 scores of 80.6% on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) with significantly shorter responses. Beyond mathematics, rStar2-Agent-14B also demonstrates strong generalization to alignment, scientific reasoning, and agentic tool-use tasks. Code and training recipes are available at https://github.com/microsoft/rStar.

Collections

Summary

The paper presents rStar2-Agent-14B, a 14B-parameter math reasoning model using agentic reinforcement learning to achieve state-of-the-art performance on math benchmarks.
It introduces the GRPO-RoC algorithm and a scalable RL infrastructure that minimizes tool errors and supports up to 45K concurrent code executions per step.
The model demonstrates strong generalization beyond mathematics through multi-turn rollouts, code-based verification, and self-reflection on execution feedback.

rStar2-Agent: Agentic Reasoning via Scalable Reinforcement Learning in Code Environments

Introduction

The rStar2-Agent technical report presents a 14B-parameter math reasoning model trained with agentic reinforcement learning (RL) in a Python code environment. The model demonstrates advanced cognitive behaviors, including judicious tool invocation, code-based verification, and self-reflection on execution feedback. The work addresses the limitations of long Chain-of-Thought (CoT) reasoning, which often fails to detect or correct subtle intermediate errors, by incentivizing models to "think smarter" through interaction with external tools and adaptive reasoning. The report details three core innovations: a scalable RL infrastructure, the GRPO-RoC algorithm for robust agentic RL, and a compute-efficient training recipe. The resulting model achieves state-of-the-art performance on competitive math benchmarks, surpassing much larger models with significantly shorter responses and strong generalization to other reasoning domains.

Figure 1: rStar2-Agent architecture, illustrating agentic reasoning with tool use and self-reflection in a code environment.

Agentic Reinforcement Learning in Code Environments

Multi-Turn Rollouts and Tool Integration

rStar2-Agent leverages a multi-turn rollout mechanism, where the model interacts with a Python code execution environment. Each reasoning trajectory consists of alternating assistant and user turns, with the model generating reasoning steps, invoking tools via structured JSON function calls, and receiving execution feedback. The prompt template (Figure 2) enforces separation between reasoning, answer, and tool usage, facilitating clean extraction and verification.

Figure 2: Prompt template specifying reasoning, answer, and tool call format for agentic RL training.

The tool call interface is standardized, supporting extensibility and alignment with LLM APIs. Execution feedback is wrapped in dedicated tags and includes successful outputs, errors, or timeouts, enabling the model to adapt its reasoning based on environment responses.

GRPO-RoC: Robust Agentic RL under Noisy Environments

The core RL algorithm is Group Relative Policy Optimization with Resample-on-Correct (GRPO-RoC). Standard outcome-only reward schemes, which assign binary rewards based on final answer correctness, are susceptible to environment-induced noise: trajectories with intermediate tool errors may still receive positive rewards, leading to lengthy, low-quality reasoning. GRPO-RoC addresses this by oversampling rollouts and asymmetrically filtering them: negative samples are preserved for diversity, while positive samples are downsampled to prioritize those with minimal tool errors and formatting violations. This strategy avoids reward hacking and stabilizes training without explicit step-level penalties.

Figure 3: GRPO-RoC reduces tool call errors in positively rewarded trajectories compared to naive GRPO, improving reasoning quality.

Scalable RL Infrastructure

High-Throughput Code Execution Environment

Agentic RL requires infrastructure capable of handling tens of thousands of concurrent tool calls per training step. rStar2-Agent implements an isolated, distributed code execution service, with centralized task queues and batched dispatch across CPU cores. This design achieves reliable execution of up to 45K tool calls per step with sub-second latency, preventing rollout bottlenecks and ensuring safety against unpredictable LLM-generated code.

Figure 4: Code environment scalability, supporting 45K concurrent tool calls per step with low latency.

Dynamic Load-Balanced Rollout Scheduling

Static rollout allocation leads to severe GPU idle time and KV cache overflow due to variable response lengths and multi-turn interactions. rStar2-Agent introduces a dynamic scheduler that assigns rollouts based on available KV cache, dispatches tool calls asynchronously, and balances computation across GPUs. This maximizes resource utilization and eliminates wasted computation.

Figure 5: Dynamic load-balanced scheduler improves GPU utilization and rollout efficiency compared to static allocation.

Training Recipe and Data Curation

The training pipeline begins with a non-reasoning SFT stage, focusing on instruction-following, tool formatting, and basic code usage, avoiding reasoning-heavy SFT to prevent overfitting and maintain concise initial responses. RL data is curated to include only high-quality, integer-answer math problems, ensuring reliable reward verification. Multi-stage RL training gradually increases response length and data difficulty, with GRPO-RoC enabling strong performance even at shorter lengths (8K→12K tokens). The final model is trained in only 510 RL steps on 64 MI300X GPUs.

Figure 6: AIME24/AIME25 accuracy and average response length across multi-stage RL training.

Experimental Results

Frontier-Level Math Reasoning

rStar2-Agent-14B achieves 80.6% pass@1 on AIME24 and 69.8% on AIME25, surpassing DeepSeek-R1 (671B) and other leading models, while generating significantly shorter responses (Table below). The model demonstrates efficient reasoning, with GRPO-RoC yielding concise, high-quality trajectories.

Model	AIME24	AIME25	Avg. Response Length (AIME24)
DeepSeek-R1-Zero	71.0	53.3	14,246.8
QWQ-32B	79.5	65.8	11,868.4
Qwen3-14B	79.3	70.4	14,747.6
rStar2-Agent-14B	80.6	69.8	9,339.7

Generalization Beyond Mathematics

Despite math-only RL training, rStar2-Agent-14B generalizes to science reasoning (GPQA-Diamond), agentic tool use (BFCL v3), and alignment tasks (IFEval, Arena-Hard), outperforming DeepSeek-V3 on most benchmarks. This suggests that agentic RL in code environments induces transferable reasoning patterns.

Ablation and Comparative Analysis

Ablation studies confirm the superiority of GRPO-RoC over vanilla agentic RL and CoT-only RL (DAPO), with consistent gains in accuracy and response efficiency (Figure 7). The model's performance saturates at the base model's reasoning capacity, with further RL leading to collapse, highlighting the importance of efficient RL to reach the capacity ceiling.

Figure 7: GRPO-RoC ablation: higher accuracy and shorter responses compared to baselines throughout training.

Analysis of Agentic Reasoning Behaviors

Token entropy analysis reveals that rStar2-Agent-14B produces high-entropy tokens during forking (exploration, self-reflection) and in response to tool feedback, driving error correction and adaptive reasoning. Coding tool call tokens are typically low-entropy, reflecting strong pretraining on code corpora. The emergence of reflection tokens on environment feedback is a distinctive feature of agentic RL, enabling more advanced cognitive behaviors than conventional long CoT.

Figure 8: Example agentic RL trace with coding tool use and self-reflection; high-entropy tokens (green) correspond to exploration and reflection.

Figure 9: Agentic RL trace showing error handling, code correction, and verification; high-entropy tokens mark adaptive reasoning steps.

Practical and Theoretical Implications

rStar2-Agent demonstrates that agentic RL in code environments can efficiently induce advanced reasoning capabilities in relatively small models, rivaling much larger counterparts. The GRPO-RoC algorithm provides a robust solution to environment-induced noise, enabling stable and effective training under outcome-only reward regimes. The scalable infrastructure and training recipe offer a blueprint for cost-effective development of reasoning agents. The observed generalization suggests that agentic RL may be a promising paradigm for broader cognitive tasks, contingent on the availability of valuable, verifiable environments.

Conclusion

rStar2-Agent establishes a new standard for agentic reasoning in LLMs, achieving state-of-the-art math performance with minimal compute and strong generalization. The work highlights the importance of scalable infrastructure, robust RL algorithms, and efficient training strategies. Future directions include extending agentic RL to diverse environments and reasoning domains, investigating the limits of model capacity, and refining reward and sampling strategies for even more effective cognitive behaviors. The public release of code and recipes will facilitate further research and practical deployment of agentic reasoning systems.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (15)

GitHub

GitHub - microsoft/rStar (625 stars)

Tweets

https://twitter.com/FrankYouChill/status/1962180218053144655

https://twitter.com/_akhaliq/status/1961254350459674904

https://twitter.com/HuggingPapers/status/1961340087607988224

https://twitter.com/iScienceLuvr/status/1962798181059817480

https://twitter.com/hillbig/status/1963006874997149921

https://twitter.com/jiqizhixin/status/1962723633308246394

YouTube

Show All Videos

rStar2-Agent (30 points, 7 comments)

alphaXiv

rStar2-Agent: Agentic Reasoning Technical Report (107 likes, 0 questions)