RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning (2504.20073v2)

Published 24 Apr 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Training LLMs as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on four stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and gradient stabilization. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.

PDF Abstract

This paper introduces RAGEN, a modular system and research platform designed for training and evaluating LLM agents in multi-turn interactive environments using reinforcement learning (RL). The work addresses the challenges of applying RL to LLMs in settings requiring sequential decision-making, memory, and adaptation to stochastic feedback, which are common in agent tasks like planning, robotics, and tutoring.

The core contribution is the StarPO (State-Thinking-Actions-Reward Policy Optimization) framework, a general approach for trajectory-level RL that optimizes the entire sequence of interactions ( $\tau = \{s_0, a^T_0, r_0, s_1, ..., a^T_{K-1}, r_{K-1}, s_K\}$ ), including observations, reasoning, actions, and environmental feedback. This contrasts with previous LLM RL methods that often focus on single-turn state-action pairs. StarPO enables LLMs to generate reasoning-guided structured outputs $a^T_t = <think>...</think><answer>~a_t~</answer>$ at each step, with the cumulative reward $R(\tau)$ over the trajectory used for policy updates.

StarPO supports various policy optimization algorithms like PPO (Schulman et al., 2017 ) and GRPO (DeepSeek-AI et al., 22 Jan 2025 ). For PPO, a critic estimates token-level values and advantages. For GRPO, a scalar trajectory reward $R(\tau_i)$ is assigned and normalized across a batch to serve as a shared advantage signal for all tokens within that trajectory. The policy probability $\pi_\theta(\tau)$ is factorized into token-level likelihoods, making it compatible with autoregressive LLMs.

The RAGEN system implements the StarPO framework, providing the infrastructure for generating rollouts, assigning rewards, and performing trajectory optimization in controlled environments. Its modular design allows for easy integration of new environments, reward functions, and rollout strategies, serving as a testbed for analyzing RL-based agent learning dynamics.

The authors conduct experiments on three stylized symbolic environments:

Bandit: A single-turn, stochastic task testing risk-sensitive reasoning under uncertainty.
Sokoban: A multi-turn, deterministic puzzle requiring irreversible planning.
Frozen Lake: A multi-turn, stochastic task combining planning with probabilistic transitions. These minimal environments decouple the paper of learning dynamics from complex real-world priors.

Through systematic experiments using the RAGEN system, the paper identifies three key findings regarding multi-turn agent RL training:

Instability Pattern (Echo Trap): Vanilla adaptations of single-turn RL methods like PPO and GRPO exhibit a recurring instability called "Echo Trap" in multi-turn settings. Agents initially improve but then collapse, overfitting to locally rewarded reasoning patterns, leading to decreased reward variance, entropy drop, and gradient spikes. PPO generally shows better stability than GRPO, suggesting a critic helps, but doesn't prevent collapse.
Collapse Indicators: Reward standard deviation and output entropy serve as early warning signals for collapse, dropping before performance degrades. Gradient norm spikes indicate irreversible training instability. Monitoring these metrics is crucial for diagnosing and mitigating collapse.
StarPO-S Stabilization: To address instability, the paper proposes StarPO-S, a stabilized variant. Key techniques include:
- Uncertainty-Based Filtering: Training is prioritized on trajectories from task instances with higher reward variance. Filtering out low-variance trajectories improves stability and efficiency, especially under PPO.
- KL Term Removal: Removing the KL divergence penalty from the objective encourages exploration.
- Asymmetric Clipping: Decoupling PPO clipping bounds ( $\epsilon_{high} > \epsilon_{low}$ ) allows stronger learning from high-reward trajectories. StarPO-S consistently delays collapse and improves peak performance compared to vanilla StarPO across tasks.
Rollout Generation Factors: The quality of self-generated rollout trajectories significantly impacts training. Optimal factors include:
- Higher Task Diversity: Using diverse initial states (prompts) with a moderate number of rollouts per prompt (e.g., 4) improves generalization by exposing the model to broader contexts and enabling comparison between outcomes.
- Moderate Interaction Granularity: Allowing a moderate number of actions per turn (e.g., 5-6 in Sokoban) provides sufficient planning space without injecting excessive noise from overly long sequences.
- Frequent Rollout Updates: Collecting fresh rollouts more frequently (Online-1 strategy) ensures better alignment between the optimization targets and the current policy, leading to faster convergence and stronger generalization.
Reasoning Emergence and Reward Signals: Symbolic reasoning, encouraged by > tokens, improves generalization in simple single-turn tasks like Bandit (even under semantic-reward misalignment). However, in multi-turn environments like Sokoban and Frozen Lake, reasoning length tends to decay during training, offering limited benefits compared to a no-thinking baseline. The authors hypothesize this is due to sparse, delayed, outcome-based reward signals in multi-turn tasks failing to reliably reinforce fine-grained reasoning steps. Models may generate hallucinated reasoning while achieving task success, highlighting the need for meticulous, reasoning-aware reward designs.
Implementation Considerations:
- Framework Implementation: StarPO is implemented on top of existing RL optimization algorithms (PPO, GRPO) but modifies the objective to operate on full trajectories rather than single steps. This requires handling variable-length sequences and attributing reward signals across turns.
- System Architecture: RAGEN is presented as a modular system likely involving components for environment interaction, rollout generation, reward calculation, and policy optimization, interconnected to support the multi-turn training loop.
- Stabilization Techniques: Implementing StarPO-S requires incorporating mechanisms for calculating and filtering rollouts based on reward variance, modifying the PPO/GRPO objectives (removing KL, adjusting clipping), and potentially using a critic for value estimation in PPO.
- Rollout Data Management: Efficiently managing and sampling from rollout data (either on-policy or from a replay buffer) is crucial. The findings emphasize the importance of data freshness.
- Prompt Engineering: Using structured prompts with explicit tags like <think> and <answer> is key to eliciting reasoning and action outputs, although the effectiveness of reasoning depends heavily on the environment and reward structure.
- Resource Efficiency: Parameter-efficient fine-tuning methods like LoRA (Hu et al., 2021 ) can significantly reduce computational requirements (GPU memory, utilization, power) while maintaining comparable performance, making it feasible to scale to larger models and longer tasks. LoRA training achieved similar validation success rates to full fine-tuning on Sokoban with substantially reduced resource usage.
- Reward Function Design: Designing effective reward functions for multi-turn, complex tasks is challenging. The paper suggests the need for signals that reinforce intermediate reasoning steps in addition to final task outcomes to prevent reasoning collapse and hallucination. Simple format-based penalties (-0.1 for invalid <think>/<answer> structure) can partially encourage adherence to the desired output format.
Limitations noted:
- RAGEN has primarily been evaluated on relatively small models (0.5B) and stylized environments; scalability to larger models or complex real-world multimodal tasks is yet to be proven.
- The current reward system relies on easily verifiable outcomes, which might not be applicable to domains without clear success criteria.
- Training efficiency is limited by long multi-turn contexts, which require large KV caches.
The paper concludes that multi-turn RL can effectively train LLM agents for reasoning and action when tailored appropriately. It shifts training emphasis towards reward-driven learning from interaction, offering a potential path for scalable AI system development in complex domains, provided challenges related to stability, efficient data utilization, and fine-grained reward design are addressed.

PDF Markdown Bookmark Chat (Pro)

Authors (18)

Zihan Wang (181 papers)
Kangrui Wang (15 papers)
Qineng Wang (8 papers)
Pingyue Zhang (7 papers)
Linjie Li (89 papers)
Zhengyuan Yang (86 papers)
Kefan Yu (3 papers)
Minh Nhat Nguyen (3 papers)
Licheng Liu (13 papers)
Eli Gottlieb (2 papers)
Yiping Lu (32 papers)
Kyunghyun Cho (292 papers)
Jiajun Wu (249 papers)
Li Fei-Fei (199 papers)
Lijuan Wang (133 papers)
Yejin Choi (287 papers)
Manling Li (47 papers)
Xing Jin (13 papers)

Related Papers

Find Related Papers

GitHub

GitHub - RAGEN-AI/RAGEN: RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments. (1,643 stars)

Tweets

https://twitter.com/wzihanw/status/1917390164902687059

https://twitter.com/ManlingLi_/status/1917689965779079192

https://twitter.com/dstackai/status/1928071216771760492

https://twitter.com/willccbb/status/1938558999105855847

https://twitter.com/TheTuringPost/status/1918093141288337619

https://twitter.com/aryagxr/status/1923558012968423567