Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 37 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 14 tok/s Pro

GPT-4o 90 tok/s Pro

Kimi K2 179 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR (2508.07534v1)

Published 11 Aug 2025 in cs.CL

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of LLMs. Unlike traditional RL approaches, RLVR leverages rule-based feedback to guide LLMs in generating and refining complex reasoning chains -- a process critically dependent on effective exploration strategies. While prior work has demonstrated RLVR's empirical success, the fundamental mechanisms governing LLMs' exploration behaviors remain underexplored. This technical report presents a systematic investigation of exploration capacities in RLVR, covering four main aspects: (1) exploration space shaping, where we develop quantitative metrics to characterize LLMs' capability boundaries; (2) entropy-performance exchange, analyzed across training stages, individual instances, and token-level patterns; and (3) RL performance optimization, examining methods to effectively translate exploration gains into measurable improvements. By unifying previously identified insights with new empirical evidence, this work aims to provide a foundational framework for advancing RLVR systems.