Exploration–exploitation dynamics (pass@k vs average@k) in agentic RL
Determine the relationship between pass@k and average@k metrics during agentic reinforcement learning with external tool interactions, clarifying the exploration–exploitation dynamics that govern training efficiency and performance.
References
For agentic RL, however, it remains unclear (i) what techniques work best for policy optimization, (ii) what is the relationship between the exploration(pass@k)-exploitation(average@k), and (iii) how does entropy affect training effectiveness, stability, and final performance.
— Demystifying Reinforcement Learning in Agentic Reasoning
(2510.11701 - Yu et al., 13 Oct 2025) in Section 4 (Algorithmic Design and Training Dynamics in Agentic RL), opening paragraph