Papers
Topics
Authors
Recent
2000 character limit reached

Agentic RL Training Systems

Updated 10 September 2025
  • Agentic RL training systems are frameworks that enable LLMs to become active agents by leveraging temporally extended, tool-integrated reinforcement learning.
  • Core architectures use multi-stage pipelines including supervised fine-tuning, agentic RL with explicit action tagging, and hierarchical credit assignment for robust planning.
  • Empirical evaluations demonstrate improvements in pass@1 accuracy, reasoning depth, and process traceability, highlighting enhanced adaptability and self-improvement.

Agentic RL training systems are frameworks and methodologies where reinforcement learning (RL) is used to transform LLMs and related AI models from passive generators into active, adaptable decision-making agents. These agents interact with external environments—typically through sequential, multi-turn decision processes involving planning, tool use, and reasoning—to achieve complex objectives. In contrast to conventional single-step RL for LLMs, agentic RL expands the scope to temporally extended, partially observable Markov decision processes (POMDPs), allowing models to plan, act, perceive, and self-improve in dynamic, multi-modal domains (Zhang et al., 2 Sep 2025, Li et al., 8 Sep 2025).

1. Agentic RL Foundations: Definition and Conceptual Shift

Agentic RL marks a paradigm shift from classic LLM-RL (which is often degenerate, single-step MDPs) to temporally extended POMDPs that require long-horizon credit assignment, exploration, and adaptive behavior (Zhang et al., 2 Sep 2025). In this regime, reinforcement learning is not merely a fine-tuning or alignment tool, but rather the principal mechanism for instilling core agentic capabilities—including planning, multi-turn tool use, memory utilization, dynamic perception, and emergent self-improvement.

Agentic RL systems require agents to interact with tool-rich environments (e.g., code execution, web search, external databases) via action protocols (such as > , <search>, <answer>), enabling closed-loop feedback rather than open-loop supervised imitation (Li et al., 8 Sep 2025). The goal becomes trajectory-level optimization, where the model must recover from unexpected states, synthesize new strategies, and learn from environmental feedback in real-time.

2. Core Architectures and Training Pipelines

Typical agentic RL pipelines employ a multi-stage protocol:

This architecture abstracts agent state as latent “semantic variables,” and agent actions as full-sequence or segmental outputs (including tool invocations), transforming each multi-step interaction into a trajectory for RL.

3. RL Algorithms and Credit Assignment

Agentic RL training systems have adopted or extended a variety of RL algorithms:

  • Trajectory-level policy optimization using PPO, GRPO, DAPO, REINFORCE++ with KL regularization is standard (Luo et al., 5 Aug 2025, Da et al., 13 Jun 2025).
  • Token masking ensures gradients only flow through the agent’s own decisions, not through tool outputs—crucial for optimizing for tool-integrated behavior (Singh et al., 28 Apr 2025).
  • Step-level and group-level credit assignment: LightningRL, ARPO, and variants perform fine-grained attribution by splitting rewards and advantages over sub-trajectories, group rollouts, or at tool boundary steps (Dong et al., 26 Jul 2025, Luo et al., 5 Aug 2025).
  • Multi-objective rewards are employed, aggregating final outcome signals (e.g., correctness, factuality), step/process rewards (e.g., format adherence, tool correctness, retrieved information gain), and efficiency terms (e.g., tool call minimization or diversity) (Li et al., 8 Sep 2025, Liu et al., 29 May 2025).

The general RL objective is represented as:

J(θ)=Eτπθ[R(q,τ)]J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[ R(q,\tau) \right]

Where R(q,τ)R(q, \tau) is a return function combining path-level and step-level rewards for query qq and trajectory τ\tau.

4. Agentic Capabilities: Planning, Tool Use, Memory, and Perception

Agentic RL explicitly targets the emergence of core capabilities:

5. Frameworks, Engineering Platforms, and Scalability

The infrastructure supporting agentic RL has become increasingly sophisticated:

6. Evaluation, Benchmarks, and Empirical Gains

Evaluation protocols measure both process and outcome:

Framework Core RL Algorithm(s) Key Innovation(s) Domains Benchmarked
Agent Lightning Hierarchical RL, Credit Assignment Decoupling agent/train, plug-and-play SQL, Retrieval-Aug. QA, Math Tools
VerlTool GRPO, Async Rollout Unified tool APIs, multi-modality Math, QA, SQL, Vision, Web, SWE
Kimi K2 Joint RL (GRPO-like), MuonClip Joint real & synthetic environments Code, Math, Reasoning, Agentics
MUA-RL GRPO, User Simulation LLM-simulated users, multi-turn RL Multi-turn Tool Use (TAU2, ACEBench)
rStar2-Agent GRPO-RoC Resample-on-correct, code exec infra Math, Alignment, Tool-use
Deep-DxSearch GRPO + Multi-reward End-to-end RL on RAG Medical Diagnosis, Reasoning
Agent-RLVR DPO with Guidance Guidance-aug. rollouts, trajectory densification SWE, code repair
Chain-of-Agents Agentic RL + Distillation Multi-agent-to-LLM trajectory distill Web Agents, Code Contests, Research

7. Practical Guidance and Open Directions

Agentic RL system design should follow several best practices:

  • Two-stage pipelines: SFT or process-constrained SFT before RL for stable rollouts.
  • Templated rollouts: Use explicit tagged segments for process supervision and traceability.
  • Reward design: Combine outcome and step-level rewards, with masking for tool outputs, groupwise baselines, and multi-objective terms.
  • Curriculum Learning: Stage complexity, moving from simple to hard environments/tasks.
  • Distributed, modular infrastructure: Exploit asynchronous, traced, plugin-based engineering for rollout and training scalability.
  • Evaluation: Capture both solution finality and process quality, using task-specific and process-informed metrics.

Future open challenges include credit assignment in extremely long, multi-agent settings, reward specification for multimodal and real-world domains, and further minimization of human prior and dependence on handcrafted schemas (Zhang et al., 2 Sep 2025, Li et al., 8 Sep 2025). RL frameworks that can decouple strategic planning, tool integration, and multi-agent coordination at scale are likely focal points of ongoing research as agentic LLMs become more complex and capable.


Agentic RL training systems constitute the functional and methodological core underlying the current wave of general-purpose AI agents—a shift from static, heuristic models to robust, learning-based, and adaptive decision-makers that plan, perceive, reason, and act autonomously in dynamic digital worlds.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Agentic RL Training Systems.