rStar2-Agent: Agentic RL for Math Reasoning

Updated 29 August 2025

rStar2-Agent is a 14B parameter LLM that leverages agentic reinforcement learning to integrate autonomous tool-use with concise problem decomposition.
It employs the innovative GRPO-RoC algorithm and a high-throughput RL infrastructure to efficiently minimize errors and optimize reasoning steps.
The model demonstrates robust cross-domain generalization and exceptional resource efficiency, establishing new benchmarks in scalable agentic reasoning.

rStar2-Agent is a 14B parameter LLM for mathematical reasoning, distinguished by its agentic reinforcement learning (agentic RL) paradigm that integrates autonomous tool-use, feedback-driven reflection, and concise problem decomposition within a scalable infrastructure. Designed to surpass traditional chain-of-thought (CoT) models, rStar2-Agent achieves leading performance on mathematics and demonstrates generalization across alignment, scientific reasoning, and agentic tool-use tasks, while maintaining high resource efficiency and short response lengths (Shang et al., 28 Aug 2025).

1. Architectural Foundations and Innovations

rStar2-Agent is constructed atop a 14B parameter base LLM, which is advanced via agentic RL to exhibit agent-like reasoning traits. Its architecture is characterized by three principal innovations:

Efficient RL Infrastructure: An execution environment is engineered for high-throughput Python tool calls—handling up to 45,000 concurrent executions per RL step with mean 0.3s latency per call. This infrastructure leverages a load-balanced rollout scheduler that adapts allocation based on per-GPU KV-cache capacity, thus maximizing throughput and eliminating compute wastage from synchronization or imbalance.
GRPO-RoC Algorithm: The model employs Group Relative Policy Optimization (GRPO) with a Resample-on-Correct (RoC) rollout strategy. In each policy improvement step, the model samples 2G rollouts per batch and selects the G highest-quality trajectories—those minimizing code tool errors and format issues—for the policy update. The RL objective is:

$J_{\text{GRPO-RoC}}(\theta) = \mathbb{E}_{(q, a) \sim D, \{\hat{o}_i\}} \left[ \frac{1}{\sum_{i=1}^G |\hat{o}_i|} \sum_{i=1}^G \sum_{t=1}^{|\hat{o}_i|} \min \left\{ \frac{\pi_\theta(\hat{o}_{i,t} | q, \hat{o}_{i,<t})} {\pi_{\theta_\text{old}}(\hat{o}_{i,t} | q, \hat{o}_{i,<t})} \hat{A}_{i,t}, \text{clip}(\cdot, 1-\epsilon_\text{low}, 1+\epsilon_\text{high}) \hat{A}_{i,t} \right\} \right]$

where $\epsilon_\text{low}=0.2$ , $\epsilon_\text{high}=0.28$ ("Clip-Higher" strategy), and $\hat{A}_{i,t}$ denotes the advantage.

Agent Training Recipe: The method uses a “non-reasoning” supervised fine-tuning (SFT) warmstart that imparts structured formatting, function-call, and instruction-following; ensuing RL stages optimize for tool-use, response conciseness, and advanced cognitive skills. RL proceeds with 510 update steps over three stages: 8K, then 12K token maximums, culminating in a focused “hard-problem” stage for difficult examples (previously unsolved in earlier phases).

These foundations collectively enable rStar2-Agent to reflect on intermediate tool outputs and iteratively refine its problem-solving with minimal wasted computation.

2. Training Regimen and Data Pipeline

Training commences with a “non-reasoning” SFT stage for alignment to protocol, leaving reasoning development for RL. The SFT phase teaches:

Structured prompt/response templates (tags for reasoning, final answer).
Proper formatting, JSON, and tool use, but does not encode long chain-of-thought explanations.

Subsequently, a three-stage RL process is employed:

RL Stage	Token Limit	Curriculum	Selection Strategy
1	8K	All problem types	Standard GRPO-RoC
2	12K	Broad, more complex	Standard GRPO-RoC
3	12K	“Hard” examples only	Filter solved cases

Binary pass/fail signals (reward=1 for correct final answer, 0 else) drive RL optimization and policy improvement via GRPO-RoC, enabling the model to autonomously explore, verify, and revise intermediate solution steps.

Notably, the RL infrastructure’s rollout orchestration and error filtering are integral to fast convergence on limited (64 MI300X GPU) compute.

3. Performance Evaluation

rStar2-Agent achieves state-of-the-art mathematics results:

Model	Params	AIME24 pass@1	AIME25 pass@1	Typical Response Length
rStar2-Agent	14B	80.6%	69.8%	Shortest among SOTA
DeepSeek-R1	671B	79.8%	70.0%	Significantly longer

Results are obtained in only 510 RL updates (approximately one week). Training curves indicate simultaneous increases in accuracy and decreases in average response length—implying progressive elimination of redundancy and more efficient reasoning. The model’s performance on “hard” problems improves throughout staged RL by dynamically filtering already-solved cases.

4. Agentic Reasoning and Tool–Use Dynamics

The model learns advanced reasoning behaviors not present in earlier RLHF or SFT-trained models:

Cautious invocation of code tools, only after sufficient deliberation.
Autonomous exploration of reasoning branches, including revision when code output or verification steps indicate errors.
Integration of code execution feedback into the solution refinement loop, producing concise, correct intermediate steps, and anticipating downstream requirements.

These features are emergent properties of the RL approach and infrastructure, which maintains high throughput and filters rollouts for correctness and minimal noise.

5. Generalization beyond Mathematical Reasoning

Despite exclusive RL training on mathematical tasks, rStar2-Agent demonstrates robust cross-domain transfer:

Scientific reasoning: On GPQA-Diamond, accuracy improves in out-of-domain science problems.
Alignment tasks: Evaluations on IFEval and Arena-Hard show strong generalization.
Agentic tool-use: On the BFCL v3 benchmark, the model autonomously invokes external Python tools as required to solve tasks.

This suggests the agentic RL process shapes a general-purpose cognitive framework effective in tasks demanding procedural reasoning, verification, and agentic interaction with tools.

6. Resource Efficiency and Open-Source Availability

rStar2-Agent is distinguished by its compute efficiency: only 64 MI300X GPUs over a single week yield SOTA results, attributed to rollout scheduler design and judicious curriculum selection. The complete codebase and training recipes are publicly released at https://github.com/microsoft/rStar, including:

Full RL infrastructure and scheduling logic.
GRPO-RoC implementation.
Training protocols covering SFT initialization, staged RL curriculum, and rollout management.

Open-source access facilitates replication and further investigation of agentic RL at scale.

7. Context and Impact

rStar2-Agent represents a convergence of efficient RL infrastructure, algorithmic refinement to address environment noise (GRPO-RoC), and a curriculum that minimizes reliance on heavy SFT for reasoning, achieving scalable, generalizable, and concise agentic reasoning. Its training efficiency and public release are likely to influence subsequent research on resource-efficient advanced reasoning agents, curriculum design, and scalable RL for tool-integrated LLMs.

A plausible implication is that agentic RL, as instantiated in rStar2-Agent, provides a methodological template for domains requiring autonomous, tool-integrated, and feedback-driven problem-solving that generalizes beyond mathematics to broader scientific and agentic tasks (Shang et al., 28 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

rStar2-Agent: Agentic Reasoning Technical Report (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to rStar2-Agent.