Optimal RL recipe for agentic reasoning

Develop the optimal reinforcement learning training recipe for agentic reasoning in large language model agents that integrate external tools, specifying the algorithmic components and settings that yield the best performance and stability.

Background

The paper highlights that GRPO-based variants for agentic reasoning differ widely in optimization granularity (token-, sequence-, or trajectory-level), clipping, KL regularization, and entropy management, yet a principled best practice is not established. This uncertainty motivates the authors’ systematic comparison of GRPO recipes and techniques.

Clarifying the optimal recipe is crucial for scaling agentic RL reliably, avoiding entropy collapse, and ensuring efficient exploration when LLM agents interleave internal reasoning with external tool calls.

References

Despite rapid progress in GRPO-based variants, the optimal RL recipe for agentic reasoning remains unclear.

— Demystifying Reinforcement Learning in Agentic Reasoning (2510.11701 - Yu et al., 13 Oct 2025) in Introduction, Algorithm-wise paragraph (#1{2})

Optimal RL recipe for agentic reasoning

Sponsor

Background

References

Related Problems