Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning (2507.10628v2)

Published 14 Jul 2025 in cs.LG and cs.AI

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a powerful paradigm for facilitating the self-improvement of LLMs, particularly in the domain of complex reasoning tasks. However, prevailing on-policy RL methods often contend with significant training instability and inefficiency. This is primarily due to a capacity-difficulty mismatch, where the complexity of training data frequently outpaces the model's current capabilities, leading to critically sparse reward signals and stalled learning progress. This challenge is particularly acute for smaller, more resource-efficient LLMs. To overcome this, we introduce the Guided Hybrid Policy Optimization (GHPO), a novel difficulty-aware reinforcement learning framework. GHPO dynamically calibrates task difficulty by employing adaptive prompt refinement to provide targeted guidance. This unique approach adaptively balances direct imitation learning for problems currently beyond the model's reach with exploration-based reinforcement learning for more manageable tasks, effectively creating a smooth and optimized learning curriculum. Extensive experiments demonstrate that GHPO achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks, consistently outperforming strong on-policy reinforcement learning and curriculum learning baselines. Further analysis confirms that our framework significantly enhances both training stability and final reasoning performance, thus offering a scalable and efficient solution for developing powerful and robust reasoning models.

Summary

  • The paper proposes GHPO, which mitigates training instability through adaptive prompt refinement combined with imitation and reinforcement learning.
  • It leverages automated difficulty detection and a multi-stage guidance approach to dynamically calibrate task difficulty based on model competence.
  • Experimental results demonstrate a 5% performance gain across benchmarks and smoother optimization dynamics compared to traditional methods.

GHPO: Adaptive Guidance for Stable and Efficient LLM Reinforcement Learning

This paper introduces Guided Hybrid Policy Optimization (GHPO), a novel framework designed to address the challenges of training instability and inefficiency in Reinforcement Learning with Verifiable Rewards (RLVR) for LLMs. The method dynamically adjusts task difficulty through adaptive prompt refinement, balancing imitation learning and exploration-based RL, leading to improved training stability, sample efficiency, and overall performance in complex reasoning tasks.

Motivation and Problem Statement

Current on-policy RLVR methods like GRPO suffer from training instability and inefficiency due to a capacity-difficulty mismatch, resulting in reward sparsity and suboptimal sample efficiency. The authors identify that the inherent difficulty of training data often exceeds the model's capabilities, leading to sparse reward signals and stalled learning progress, especially for smaller LLMs. To mitigate these issues, the paper proposes GHPO, which adaptively calibrates task difficulty by employing prompt refinement to provide targeted guidance.

Guided Hybrid Policy Optimization (GHPO) Framework

Figure 1

Figure 1: A visual representation of the GHPO framework, highlighting the automated difficulty detection and adaptive prompt refinement modules.

GHPO combines online RL and imitation learning within a unified framework, using a dynamic mechanism to assess sample difficulty and then employs adaptive prompt refinement to provide varying levels of guidance. For problems the model can likely handle, GHPO primarily uses standard on-policy RL. But for more challenging samples, it shifts to a form of imitation learning by offering explicit solution traces. The framework comprises two core modules:

  • Automated Difficulty Detection: Assesses the inherent difficulty of the current problem to determine the subsequent learning process.
  • Adaptive Prompt Refinement: Based on the detected difficulty, this module adaptively refines the prompt by incorporating different levels of ground truth guidance.

    The GHPO method employs a dynamic data augmentation strategy. As the policy model's capabilities improve throughout the learning process, queries that initially required a high hint ratio ω\omega may eventually need only a lower level of guidance, or even no guidance at all.

Implementation Details

In the GHPO framework, hints are extracted from the ground-truth solution and appended to the input problem with a guiding sentence: "The following text is the beginning part of the answer, which you can refer to for solving the problem:". The hint ratio ω\omega is dynamically adjusted using a multi-stage guidance approach with a linear schedule. In the experiments, a maximum of three stages with ω={0.25,0.5,0.75}\omega = \{0.25, 0.5, 0.75\} are used. Figure 2

Figure 2: This figure illustrates the application of GHPO to address a difficult problem by extracting 50\% of the ground truth solution as a hint.

The paper also introduces an optional cold-start strategy where, for the first NN optimization steps, the difficulty detection mechanism is temporarily disabled, and the original GRPO training process is applied. This allows the model to develop fundamental formatting capabilities and prevents the introduction of early bias before adaptive guidance is implemented.

Experimental Results

The authors conduct extensive experiments on six mathematics benchmarks, demonstrating that GHPO outperforms state-of-the-art RL methods and baselines like curriculum learning. Figure 3

Figure 3: The bar chart compares the performance of GHPO against other models across various benchmarks, highlighting GHPO's superior performance.

Key findings include:

  • GHPO achieves an average performance gain of approximately 5% across six challenging mathematics benchmarks.
  • GHPO enhances training stability, leading to more reliable convergence.
  • GHPO demonstrates superior performance over curriculum learning and fixed hint injection strategies. Figure 4

    Figure 4: This plot shows the proportion of problems detected as difficult within a mini-batch during training.

    Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: This figure displays the format reward during training.

The training dynamics analysis reveals that GHPO maintains smaller gradient magnitudes compared to GRPO, indicating a smoother and more stable optimization process. GHPO also generates longer responses than GRPO in later stages, suggesting an enhanced capacity to construct more detailed and elaborate reasoning processes.

Implications and Future Work

The GHPO framework offers a practical and scalable solution for developing powerful reasoning LLMs by intelligently adapting the learning process to the model's evolving capabilities, leading to more stable and effective RL fine-tuning. The adaptive guidance mechanism effectively complements even advanced pre-training, enabling more efficient and effective fine-tuning for complex reasoning tasks. Future work could explore extending GHPO to other complex reasoning tasks and investigating alternative difficulty detection and prompt refinement strategies.

Conclusion

This paper presents a compelling approach to address the challenges of training instability and inefficiency in RLVR for LLMs. The GHPO framework's ability to dynamically calibrate task difficulty through adaptive prompt refinement leads to significant performance gains and improved training stability. The results demonstrate the effectiveness and robustness of GHPO, offering a promising direction for developing powerful and robust reasoning models.