Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 61 tok/s Pro
GPT-5 Medium 25 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 111 tok/s Pro
Kimi K2 200 tok/s Pro
GPT OSS 120B 463 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning (2503.22456v2)

Published 28 Mar 2025 in cs.LG and cs.AI

Abstract: We introduce Entropy-Guided Sequence Weighting (EGSW), a novel approach that enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated outputs based on their advantage and entropy for Reinforcement Learning-based LLM fine-tuning. EGSW integrates entropy regularization with advantage-based weighting to balance policy updates, enabling efficient exploration in high-dimensional state spaces. By employing temperature-scaled softmax weighting over sequences, EGSW prioritizing high-reward, high-uncertainty steps while maintaining training stability. Although originally developed to improve Group Relative Policy Optimization (GRPO) during LLM fine-tuning, EGSW is generalizable to other reinforcement learning (RL) algorithms and can be implemented in both step-wise and trajectory-wise settings. Empirical evaluations demonstrate that EGSW enhances GRPO reasoning ability, yielding improvements in sample efficiency. Future work will explore the application of EGSW to advanced RL methodologies.

Summary

  • The paper introduces Entropy-Guided Sequence Weighting (EGSW) to dynamically balance exploration and exploitation in RL-based LLM fine-tuning.
  • EGSW integrates entropy regularization with advantage-based weighting to boost reward scores and enhance reasoning capabilities in LLMs.
  • Experiments on Qwen2.5-Math models highlight that precise hyperparameter tuning of α and P is crucial for achieving stable and improved performance.

Entropy-Guided Sequence Weighting for Efficient Exploration in RL-Based LLM Fine-Tuning

This paper introduces Entropy-Guided Sequence Weighting (EGSW), a novel method designed to enhance exploration during RL-based fine-tuning of LLMs. EGSW dynamically adjusts the weights of generated sequences based on both their advantage and entropy, effectively balancing exploration and exploitation. The method aims to improve reasoning capabilities in LLMs by prioritizing high-reward, high-uncertainty steps, and it is generalizable to both step-wise and trajectory-wise RL frameworks.

Background and Motivation

The primary challenge in fine-tuning LLMs with RL lies in efficiently exploring high-dimensional state spaces. Traditional methods like MCTS are computationally expensive, rendering them impractical for LLM fine-tuning. While search-free methods such as GRPO offer computational efficiency, they often suffer from suboptimal exploration due to their reliance on policy-driven sampling. EGSW addresses these limitations by integrating entropy regularization with advantage-based weighting, promoting a more balanced and efficient exploration strategy.

Entropy-Guided Sequence Weighting (EGSW)

EGSW enhances the exploration-exploitation tradeoff by dynamically assigning weights to generated sequences based on their advantage and entropy. The raw weight for sequence tt is computed as:

%%%%1%%%%

where Ai,tA_{i, t} is the advantage, Hi,tH_{i, t} is the entropy, α\alpha is a hyperparameter scaling the entropy contribution, and PP is a temperature parameter controlling weight distribution sparsity. The entropy at step tt, Hi,tH_{i,t}, is calculated as:

Hi,t=aAπθ(aq,ai,<t)logπθ(aq,ai,<t),H_{i,t} = - \sum_{a \in \mathcal{A}} \pi_{\theta}(a | q, a_{i,<t}) \log \pi_{\theta}(a | q, a_{i,<t}),

where πθ(aq,ai,<t)\pi_{\theta}(a | q, a_{i,<t}) is the probability of selecting action aa given state qq under policy πθ\pi_{\theta}, and A\mathcal{A} represents the action space. These raw weights are then normalized using a softmax function to ensure proper scaling and training stability:

wi,t=wi,trawj=1Nwj,traw=exp(Ai,t+αHi,tP)j=1Nexp(Aj,t+αHj,tP).w_{i,t} = \frac{w_{i,t}^{\text{raw}}}{\sum_{j=1}^{N} w_{j,t}^{\text{raw}}} = \frac{\exp\left(\frac{A_{i,t} + \alpha H_{i,t}}{P}\right)}{\sum_{j=1}^{N} \exp\left(\frac{A_{j,t} + \alpha H_{j,t}}{P}\right)}.

The normalized weights wi,tw_{i,t} are used to reweight the policy gradient update, given by:

θJEGSW(θ)=1Ki=1K1Nkt=1Nkwi,t[A^i,t+β(πref(ai,tq,ai,<t)πθ(ai,tq,ai,<t)1)]θlogπθ(ai,tq,ai,<t).\nabla_\theta \mathcal{J}_{\text{EGSW}}(\theta) = \frac{1}{K} \sum_{i=1}^{K} \frac{1}{N_{k}} \sum_{t=1}^{N_{k}} w_{i,t} \Bigg[ \hat{A}_{i,t} + \beta \left( \frac{\pi_{\text{ref}}(a_{i,t}|q, a_{i,<t})}{\pi_{\theta}(a_{i,t}|q, a_{i,<t})} - 1 \right) \Bigg] \nabla_\theta \log \pi_\theta(a_{i,t}|q, a_{i,<t}). Figure 1

Figure 1: Training reward of the methods based on Qwen2.5-Math-7B.

Experimental Results

The authors fine-tuned EGSW on top of the GRPO framework, adopting the Simple-RL Reason approach with Qwen2.5-Math-7B and Qwen2.5-Math-7B-Instruct as base models. The integration was implemented using the Hugging Face open-r1, transformers, and trl repositories.

Experiments demonstrated that EGSW consistently outperformed standard GRPO in terms of reward scores. For hyperparameter tuning, the scaling coefficient α\alpha was adjusted between 0.15 and 0.50, and the temperature parameter PP was explored between 1 and 2. The optimal performance for Qwen2.5-Math-7B-Instruct was achieved with normalized entropy, α=0.8\alpha = 0.8, and P=1.8P = 1.8, while Qwen2.5-Math-7B performed best with normalized entropy, α=0.3\alpha = 0.3, and P=1P = 1. Figure 2

Figure 2: Training reward of the methods based on Qwen2.5-Math-7B-Instruct.

Discussion and Observations

A key observation is that fine-tuning Qwen2.5-Math-7B-Instruct with GRPO using an 8K MATH dataset alone did not improve the model’s reasoning capability, likely due to the model’s prior fine-tuning. However, incorporating EGSW enhanced the model’s reasoning ability, attributed to EGSW’s capacity to encourage better exploration. Moreover, the authors noted that effective exploration enables the model to generate fewer tokens while achieving higher rewards. Figure 3

Figure 3: Completion length of the methods based on Qwen2.5-Math-7B.

The authors also highlight that EGSW is highly sensitive and requires careful hyperparameter tuning, particularly in balancing the entropy coefficient α\alpha to prevent excessive exploration. Additionally, EGSW reduces the overall gradient norm by using the weights, necessitating adjustments to the learning rate for stable learning.

Conclusion

The paper introduces EGSW as an effective method for enhancing the exploration-exploitation tradeoff in RL-based LLM fine-tuning. By integrating entropy into the weighting mechanism, EGSW promotes more diverse and informative trajectories while maintaining a focus on high-reward outputs. Empirical results demonstrate that EGSW enhances GRPO by achieving higher reward scores and improving reasoning capabilities. Future work could explore integrating EGSW with other reinforcement learning-based fine-tuning strategies to further enhance model performance across broader tasks.

Authors (1)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube