Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 60 tok/s Pro

GPT-5 Medium 32 tok/s

GPT-5 High 27 tok/s Pro

GPT-4o 102 tok/s

GPT OSS 120B 461 tok/s Pro

Kimi K2 227 tok/s Pro

2000 character limit reached

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey (2509.02547v1)

Published 2 Sep 2025 in cs.AI and cs.CL

Abstract: The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to LLMs (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM-RL with the temporally extended, partially observable Markov decision processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

Collections

Summary

The paper introduces agentic RL, transforming LLMs from passive generators into decision-making agents with extended planning and interactive tool use.
It reviews key methodologies including policy gradient methods, hybrid reward frameworks, and advanced memory and self-improvement mechanisms.
The survey highlights open challenges in trustworthiness, scaling, and environment design to guide future research on autonomous agents.

The Landscape of Agentic Reinforcement Learning for LLMs

Agentic Reinforcement Learning (Agentic RL) represents a paradigm shift in the intersection of LLMs and reinforcement learning (RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents capable of long-horizon, interactive behaviors. This survey provides a comprehensive synthesis of the theoretical foundations, algorithmic advances, capability taxonomies, domain applications, and open challenges in Agentic RL, consolidating insights from over 500 recent works.

Paradigm Shift: From LLM-RL to Agentic RL

Traditional LLM-RL approaches, such as RLHF and DPO, treat LLMs as static conditional generators, optimizing for single-turn outputs using degenerate, single-step MDPs. In contrast, Agentic RL formalizes LLMs as policies embedded in temporally extended, partially observable MDPs (POMDPs), enabling sequential decision-making, tool use, memory, and adaptive planning in dynamic environments.

Figure 1: Paradigm shift from LLM-RL to agentic RL.

The key distinctions are:

State and Observation: Agentic RL operates over evolving world states and partial observations, rather than static prompts.
Action Space: Actions include both free-form text and structured tool/environment interactions.
Transition Dynamics: Environment transitions are stochastic and history-dependent.
Reward Structure: Supports dense, sparse, and learned rewards, enabling long-term credit assignment.
Learning Objective: Maximizes discounted cumulative reward over trajectories, not just immediate output quality.

This formalization underpins the emergence of LLMs as agentic entities, capable of perceiving, reasoning, planning, and adapting in open-ended tasks.

Agentic RL: Capability-Centric Taxonomy

Agentic RL is best understood through the lens of core agentic capabilities, each of which is transformed from a static module into an RL-optimizable policy:

Figure 2: A summary of the overall six aspects where RL empowers agentic LLMs.

Planning

RL enables LLMs to move beyond prompt-based or SFT-mimicked planning, supporting both external search-guided planning (e.g., MCTS with RL-trained heuristics) and direct policy optimization for intrinsic planning strategies. Progressive RL frameworks (e.g., AdaPlan, PilotRL) demonstrate improved long-horizon coordination and adaptability.

Tool Use

The evolution from ReAct-style pipelines to deeply interleaved Tool-Integrated Reasoning (TIR) is driven by RL. RL-trained agents autonomously discover optimal tool invocation strategies, adapt to novel scenarios, and recover from errors. Modern systems (e.g., ToolRL, AutoTIR, VTool-R1) exhibit emergent behaviors such as self-correction and multi-tool composition.

Figure 3: The development of agentic tool use, highlighting representative milestones.

Memory

RL transforms memory from a passive store (RAG) into an active, policy-controlled subsystem. RL-based memory managers (e.g., Memory-R1, MemAgent) learn to add, update, or delete memory entries, optimizing retrieval and compression for downstream performance. The frontier lies in RL-driven structured memory (e.g., temporal graphs, atomic notes).

Self-Improvement

Agentic RL supports both verbal self-correction (e.g., Reflexion, Self-Refine) and internalized, gradient-based self-improvement (e.g., KnowSelf, Reflection-DPO, R-Zero). Advanced agents autonomously generate curricula, synthesize problems, and bootstrap their own learning via self-play and execution-guided feedback.

Reasoning

RL enables the transition from fast, heuristic-driven reasoning to slow, deliberate, multi-step reasoning. RL-based methods (e.g., GRPO, DAPO, R1-style RL) incentivize chain-of-thought, verification, and planning, yielding higher accuracy and robustness in knowledge-intensive domains.

Perception

RL is critical for aligning vision-language(-action) models (VLMs, LVLMs) with complex, multi-step reasoning objectives. RL-based approaches (e.g., Vision-R1, VLM-R1, SVQA-R1) move beyond passive perception to active visual cognition, grounding reasoning in visual evidence, tool use, and imagination.

Figure 4: The dynamic interaction process between agentic LLMs and the environment.

Agentic RL: Task-Centric Taxonomy

Agentic RL has been instantiated across a broad spectrum of domains, each presenting unique challenges and opportunities:

Figure 5: The evolution tree of RL for domain-specific agents.

Search/Research Agents: RL optimizes multi-turn search, reasoning, and synthesis, outperforming prompt-based and SFT-only baselines in open-domain research tasks.
Code Generation and Software Engineering: RL with outcome/process rewards enables robust code synthesis, debugging, and full-stack software engineering, leveraging verifiable execution signals.
Mathematical Reasoning: RL is essential for both informal (natural language, symbolic computation) and formal (theorem proving) mathematics, supporting tool-integrated reasoning and proof search.
GUI/Web Agents: RL enables robust, long-horizon interaction in dynamic GUI and web environments, surpassing SFT in grounding, planning, and recovery.
Vision/Multimodal Agents: RL aligns multimodal reasoning, grounding, and tool use, supporting both perception and generation tasks.
Embodied Agents: RL is critical for navigation and manipulation in real/simulated environments, supporting generalization and compositionality.
Multi-Agent Systems: RL enables dynamic coordination, role allocation, and collaborative reasoning in LLM-based MAS.

RL Algorithms and Scaling

The survey provides a detailed taxonomy of RL algorithms for LLMs, including:

Policy Gradient Methods: REINFORCE, PPO, GRPO, and their variants, with trade-offs in sample efficiency, stability, and computational cost.
Preference Optimization: DPO, IPO, KTO, and step-wise extensions, enabling alignment without explicit reward models.
Hybrid/Process-Based Methods: Integration of outcome and process rewards, expert iteration, and curriculum learning.

Scaling RL for agentic LLMs requires advances in:

Compute and Data Efficiency: Distributed, asynchronous RL frameworks (e.g., AREAL, SkyRL, AgentFly) and curriculum generation.
Environment Complexity: Dynamic, adaptive environments and automated reward/curriculum generation.
Algorithmic Stability: Robust credit assignment, reward shaping, and hybrid SFT+RL pipelines.

Open Challenges and Future Directions

The survey identifies three grand challenges:

Trustworthiness: RL amplifies risks of reward hacking, hallucination, and sycophancy. Mitigation requires robust sandboxing, process-based rewards, adversarial training, and continuous monitoring.
Scaling Agentic Training: Efficient scaling of compute, data, and model size is essential. Hybrid training recipes, meta-learning, and principled difficulty calibration are promising directions.
Scaling Agentic Environments: Co-evolution of agents and environments, automated reward/curriculum design, and procedural content generation are critical for robust generalization.

Conclusion

Agentic RL marks a fundamental shift in the design and training of LLM-based agents, enabling the emergence of scalable, general-purpose AI systems with adaptive, robust, and interactive capabilities. RL is the critical mechanism for transforming static modules into dynamic, optimizable policies, supporting planning, tool use, memory, reasoning, perception, and self-improvement across diverse domains. The field is rapidly evolving, with open challenges in trustworthiness, scaling, and environment design that will shape the trajectory of general-purpose agentic intelligence.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (25)

First 10 authors:

Tweets

https://twitter.com/TheTuringPost/status/1964327503860031503

https://twitter.com/aisearchio/status/1963633255665557598

https://twitter.com/TeemuMtt3/status/1963699337835503831

https://twitter.com/jreuben1/status/1963288229068890174

https://twitter.com/AINativeF_zh/status/1963413664792547564

https://twitter.com/griffintaur/status/1963225493475311913