First Return, Entropy-Eliciting Explore (2507.07017v1)
Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) improves the reasoning abilities of LLMs but it struggles with unstable exploration. We propose FR3E (First Return, Entropy-Eliciting Explore), a structured exploration framework that identifies high-uncertainty decision points in reasoning trajectories and performs targeted rollouts to construct semantically grounded intermediate feedback. Our method provides targeted guidance without relying on dense supervision. Empirical results on mathematical reasoning benchmarks(AIME24) show that FR3E promotes more stable training, produces longer and more coherent responses, and increases the proportion of fully correct trajectories. These results highlight the framework's effectiveness in improving LLM reasoning through more robust and structured exploration.
Summary
- The paper presents a novel structured exploration framework that uses token entropy to pinpoint high-uncertainty positions for focused RL in LLMs.
- It employs a two-phase strategy, combining a base trajectory with targeted partial rollouts to improve intermediate credit assignment in sparse-reward settings.
- Empirical results demonstrate improved training stability, efficient exploration, and up to +6.1% accuracy gains on challenging math reasoning benchmarks.
FR3E: Structured Exploration for Reinforcement Learning from Verifiable Rewards in LLMs
The paper "First Return, Entropy-Eliciting Explore" (FR3E) (2507.07017) introduces a structured exploration framework for reinforcement learning from verifiable rewards (RLVR) in LLMs, with a focus on mathematical reasoning tasks. The work addresses the persistent challenge of unstable and inefficient exploration in RLVR, particularly the difficulty of assigning credit to intermediate reasoning steps in long, sparse-reward trajectories.
Motivation and Context
RLVR has become a standard approach for improving LLM reasoning, but existing methods—such as PPO, GRPO, and their variants—struggle with credit assignment and exploration. Value-based methods require training a critic over a vast state space, leading to instability and computational overhead. Trajectory-level reward assignment, as in GRPO, fails to distinguish between pivotal and inconsequential steps, resulting in suboptimal learning. Heuristic or sampling-based intermediate reward estimation (e.g., VinePPO, PRMs, PRIME) introduces additional complexity, variance, or labeling costs.
FR3E is motivated by the need for a value-model-free, data-efficient, and semantically grounded exploration strategy that can provide targeted feedback at critical decision points in LLM reasoning trajectories.
Methodology
FR3E decomposes RL training into two complementary phases:
- First Return: The model generates a base trajectory for a given prompt. Token-wise entropy is computed along the trajectory to identify high-uncertainty positions—tokens where the model is least confident. The top-K entropy positions are selected as anchors for further exploration.
- Entropy-Eliciting Explore: From each identified anchor (intermediate state), the model performs multiple partial rollouts, generating alternative continuations. Each rollout is evaluated for correctness, and the empirical value of the anchor state is estimated as the average reward across rollouts.
This process yields semantically meaningful, localized feedback signals that are not available in standard autoregressive generation. The approach is inspired by the "First Return, Then Explore" paradigm from Go-Explore, adapted to the sequential nature of LLMs.
Key Implementation Details
- Entropy Computation: For each token position k in the base trajectory, entropy Hk is computed over the model's output distribution. High-entropy positions are selected globally, not just locally, to focus exploration on the most uncertain reasoning steps.
- Block Segmentation: The trajectory is segmented into blocks at the selected entropy positions, enabling fine-grained policy refinement and credit propagation.
- Partial Rollouts: For each anchor state, M rollouts are generated, and their correctness is used to estimate the value V(Sj) of the state.
- Adaptive Advantage Modulation: The advantage function is dynamically scaled based on the marginal improvement in value between consecutive anchor states, stabilizing learning and preventing premature convergence.
- Rejection Sampling: Prompts that yield degenerate batches (all correct or all incorrect rollouts) are filtered out to maintain informative gradient estimates.
- Clip-Higher: An asymmetric clipping strategy is used in policy updates to encourage exploration by allowing more aggressive increases in the probability of underexplored actions.
Empirical Results
FR3E is evaluated on a suite of mathematical reasoning benchmarks (AIME24, GSM8K, Math500, Minerva Math, Gaokao2023en, OlympiadBench, College Math, AMC23) using Qwen2.5 model variants (7B, Math-7B, 32B). The main findings are:
- Training Stability: FR3E maintains higher and more stable entropy during training, especially in larger and general-purpose models, indicating healthier exploration and avoidance of entropy collapse.
- Performance: On AIME24, FR3E achieves up to +6.1% accuracy improvement over GRPO++ on Qwen2.5-32B. Across all benchmarks, FR3E consistently matches or outperforms GRPO++, with the largest gains in generalist models.
- Trajectory Quality: FR3E increases the proportion of "All-Right" (fully correct) trajectories and reduces "All-Wrong" ones during exploration, indicating more reliable and consistent reasoning.
- Response Length: FR3E supports the generation of longer, more coherent reasoning chains, particularly in models not already specialized for mathematics.
- Advantage Estimation: The modulated advantage remains tightly centered around zero, preserving unbiased policy gradients and minimizing distributional shift.
Implications and Discussion
Practical Implications
- Value-Model-Free RL: FR3E eliminates the need for a value network, reducing computational complexity and instability, and making it more practical for large-scale LLM training.
- Efficient Exploration: By focusing rollouts on high-uncertainty anchors, FR3E achieves more data-efficient exploration, reducing the number of full-sequence rollouts required.
- Generalization: The method generalizes well across diverse reasoning tasks and model sizes, with the most significant benefits in generalist and large-scale models.
- Domain-Specific Models: Gains are limited in highly specialized models (e.g., Qwen2.5-Math-7B), suggesting that RL strategies must be carefully aligned with model pretraining and domain priors.
Theoretical Implications
- Credit Assignment: FR3E provides a principled approach to intermediate credit assignment by leveraging model-intrinsic uncertainty, rather than relying on external heuristics or dense supervision.
- Structured Exploration: The adaptation of Go-Explore principles to LLMs demonstrates the value of structured, uncertainty-driven exploration in high-dimensional, sequential decision spaces.
Limitations and Future Directions
- Computational Overhead: While more efficient than full-trajectory rollouts, partial rollouts from multiple anchors still incur nontrivial inference costs, especially for long sequences and large models.
- Anchor Selection: The choice of top-K entropy positions may not always align with true causal decision points; further research into more sophisticated uncertainty metrics or causal analysis could enhance anchor selection.
- Extension to Other Domains: While demonstrated on mathematical reasoning, the approach could be extended to code generation, multi-turn dialogue, or other complex reasoning tasks, provided suitable reward signals are available.
Speculation on Future Developments
- Integration with Automated Process Supervision: Combining FR3E with automated subgoal identification (e.g., via LLMs or process reward models) could further improve credit assignment and exploration.
- Hierarchical RL: The block segmentation induced by entropy anchors could serve as a foundation for hierarchical RL in LLMs, enabling multi-level policy optimization.
- Adaptive Exploration Schedules: Dynamic adjustment of the number and location of anchors based on training progress or task difficulty could yield further efficiency gains.
- Broader Application: The structured exploration paradigm may inform RL approaches in other domains with sparse, delayed rewards and complex action spaces, such as robotics or planning.
Conclusion
FR3E represents a significant advancement in structured exploration for RLVR in LLMs, offering a practical, value-model-free framework that leverages model uncertainty to guide exploration and credit assignment. The empirical results demonstrate improved stability, efficiency, and reasoning quality, particularly in large and generalist models. The approach opens new avenues for principled exploration strategies in sequential decision-making with LLMs and suggests promising directions for future research in RL-based LLM training.
Follow-up Questions
We haven't generated follow-up questions for this paper yet.
Related Papers
- On Designing Effective RL Reward at Training Time for LLM Reasoning (2024)
- Entropy-Regularized Process Reward Model (2024)
- On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (2025)
- Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning (2025)
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models (2025)