Dice Question Streamline Icon: https://streamlinehq.com

Exploration–exploitation dynamics (pass@k vs average@k) in agentic RL

Determine the relationship between pass@k and average@k metrics during agentic reinforcement learning with external tool interactions, clarifying the exploration–exploitation dynamics that govern training efficiency and performance.

Information Square Streamline Icon: https://streamlinehq.com

Background

Prior work on self-contained generation suggests RL tends to improve Pass@1 while suppressing further exploration (Pass@k). Agentic RL, by leveraging tool feedback, may alter this dynamic, but the relationship remains unclear.

The authors emphasize resolving this uncertainty to understand how exploration can be maintained or amplified and how it translates into exploitable performance gains under agentic RL.

References

For agentic RL, however, it remains unclear (i) what techniques work best for policy optimization, (ii) what is the relationship between the exploration(pass@k)-exploitation(average@k), and (iii) how does entropy affect training effectiveness, stability, and final performance.

Demystifying Reinforcement Learning in Agentic Reasoning (2510.11701 - Yu et al., 13 Oct 2025) in Section 4 (Algorithmic Design and Training Dynamics in Agentic RL), opening paragraph