More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration (2510.02227v1)

Published 2 Oct 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is a promising paradigm for enhancing the reasoning ability in LLMs. However, prevailing methods primarily rely on self-exploration or a single off-policy teacher to elicit long chain-of-thought (LongCoT) reasoning, which may introduce intrinsic model biases and restrict exploration, ultimately limiting reasoning diversity and performance. Drawing inspiration from multi-teacher strategies in knowledge distillation, we introduce Adaptive Multi-Guidance Policy Optimization (AMPO), a novel framework that adaptively leverages guidance from multiple proficient teacher models, but only when the on-policy model fails to generate correct solutions. This "guidance-on-demand" approach expands exploration while preserving the value of self-discovery. Moreover, AMPO incorporates a comprehension-based selection mechanism, prompting the student to learn from the reasoning paths that it is most likely to comprehend, thus balancing broad exploration with effective exploitation. Extensive experiments show AMPO substantially outperforms a strong baseline (GRPO), with a 4.3% improvement on mathematical reasoning tasks and 12.2% on out-of-distribution tasks, while significantly boosting Pass@k performance and enabling more diverse exploration. Notably, using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher (e.g., DeepSeek-R1) with more data. These results demonstrate a more efficient and scalable path to superior reasoning and generalizability. Our code is available at https://github.com/SII-Enigma/AMPO.

Summary

The paper introduces AMPO, which adaptively replaces on-policy failures with teacher guidance to improve exploration and reasoning efficiency.
The paper demonstrates significant performance gains on in-distribution and out-of-distribution tasks while using far fewer training samples.
The paper validates that comprehension-based guidance selection and mixed-objective optimization are key to achieving robust pass@k and stable training dynamics.

Adaptive Multi-Guidance Policy Optimization: Enhancing LLM Reasoning via Diverse, On-Demand Teacher Guidance

Introduction

The paper introduces Adaptive Multi-Guidance Policy Optimization (AMPO), a reinforcement learning (RL) framework for LLMs that addresses the limitations of single-teacher mixed-policy RL approaches in reasoning tasks. AMPO leverages a pool of multiple teacher models, providing external guidance only when the on-policy model fails, and selects guidance adaptively based on the student model's comprehension. This approach is motivated by the need to overcome the exploration bottleneck and capacity-difficulty mismatch inherent in on-policy RLVR (Reinforcement Learning with Verifiable Rewards), and to avoid the diversity and generalization limitations of single-teacher distillation.

Methodology

Group Relative Policy Optimization (GRPO) Foundation

AMPO builds upon GRPO, which normalizes rewards within a group of sampled solutions for a given query, obviating the need for a value model and improving training efficiency. The GRPO objective is:

$A_{i,t} = \frac{R(o_{i}) - \operatorname{mean}(R(o_i))}{\operatorname{std}(R(o_i))}$

where $R(\cdot)$ is a rule-based verifier. The policy update uses a clipped surrogate objective, similar to PPO, but with group-based normalization.

Adaptive Multi-Guidance Replacement

AMPO introduces a Multi-Guidance Pool $\mathcal{P}_G$ containing correct responses from multiple teacher models. For each query, the on-policy model generates $G$ responses. If all responses are incorrect (reward below threshold $\tau$ ), $k$ of these are replaced with top- $k$ off-policy responses from $\mathcal{P}_G$ , selected via a comprehension-based metric. This ensures that external guidance is only injected when the model is unable to solve the problem, preserving self-exploration and minimizing unnecessary intervention.

Figure 1: The AMPO training framework. It enhances exploration by adaptively replacing on-policy failures with external solutions from a Multi-Guidance Pool only when sparse rewards occur. The selection of external guidance is prioritized based on the Policy Model's comprehension score for each option, ensuring effective learning.

Comprehension-Based Guidance Selection

To maximize learning efficiency, AMPO ranks candidate teacher responses by the student model's likelihood of generating the correct answer given the teacher's reasoning path. The comprehension score $r_p$ is:

$r_{p}(o^{\text{off}}) = \operatorname{clip}\left(\exp\left(\frac{1}{|y^*|} \sum_{\tau_i \in y^*} \log \pi_{\theta}(\tau_i | z^{\text{off}}, y^*_{<i})\right), 0, 1\right)$

where $o^{\text{off}} = (z^{\text{off}}, y)$ is the teacher's response, and $y^*$ is the ground-truth answer. This mechanism ensures that the student learns from guidance it is most likely to assimilate, balancing exploration and exploitation.

Mixed-Objective Policy Optimization

The final policy objective aggregates on-policy and off-policy losses, with sequence-level aggregation for off-policy responses to avoid length bias, and token-level aggregation for on-policy responses. The importance sampling ratios are adjusted according to the origin of each response (student or teacher policy).

Experimental Results

Reasoning Performance and Data Efficiency

AMPO was evaluated on six in-distribution mathematical reasoning benchmarks and three out-of-distribution (OOD) tasks, using Qwen2.5-7B-Ins as the primary base model. AMPO outperformed GRPO by 4.3% on in-distribution and 12.2% on OOD tasks, and matched the performance of LUFFY (a strong single-teacher baseline) while using only 8.5k training samples compared to LUFFY's 46k. This demonstrates the data efficiency and generalization benefits of multi-teacher guidance.

Reasoning Efficiency

AMPO produces more concise solutions than SFT, SFT+GRPO, and LUFFY, with an average response length of 1268 tokens, indicating efficient reasoning without sacrificing accuracy.

Figure 2: Training Average Score of Validation during GRPO and AMPO training.

Exploration and Pass@k Metrics

AMPO maintains higher pass@k values across multiple reasoning benchmarks, indicating superior exploration and the ability to generate diverse solutions. On challenging datasets such as AIME24/25, AMPO's pass@256 curve significantly outperforms both GRPO and the base model.

Figure 3: Pass@K Performance with different RL algorithms across several reasoning benchmarks.

Training Dynamics

AMPO exhibits higher and more stable training rewards, longer and more thoughtful responses, and maintains higher policy entropy throughout training compared to GRPO. This prevents premature convergence to suboptimal solutions and supports robust long-term learning.

Figure 4: Training Rewards.

Figure 5: Training Rewards.

Ablation Studies

Ablations confirm the necessity of each AMPO component:

Removing adaptive replacement increases response length without accuracy gain.
Replacing comprehension-based selection with random or length-based heuristics reduces accuracy by up to 1.7%.
Sequence-level aggregation for off-policy loss is critical to avoid bias.

Increasing the number of guidance replacements ( $k_0$ ) improves accuracy and exploration but can destabilize training and increase response length, indicating a trade-off between diversity and stability.

Teacher Pool Composition

Multi-teacher pools consistently outperform single-teacher setups, even when the single teacher is a much larger model (e.g., DeepSeek-R1). Combining LongCoT and ShortCoT teachers can further optimize the trade-off between accuracy and response length.

Computational Resource Analysis

AMPO incurs moderate computational overhead compared to GRPO, justified by its performance gains. The adaptive replacement mechanism is critical for efficiency, and teacher pool composition can be tuned to manage resource requirements.

Implications and Future Directions

AMPO demonstrates that multi-teacher, adaptive guidance in RLVR for LLMs yields superior reasoning, generalization, and exploration compared to single-teacher or static-mixing approaches. The framework is data-efficient, scalable, and robust across model sizes and architectures. The results challenge the prevailing assumption that a single, more powerful teacher is optimal for mixed-policy RL, showing that diversity and adaptivity are more important for robust reasoning.

Future research directions include:

Systematic exploration of optimal teacher pool composition and size.
Dynamic adjustment of $k_0$ and rollout size for further stability and efficiency.
Extension to other domains beyond mathematical reasoning, including code and multimodal tasks.
Integration with verifier-free RLVR paradigms and open-ended exploration.

Conclusion

AMPO provides a principled, practical framework for leveraging diverse, on-demand teacher guidance in RLVR for LLMs. Its adaptive, comprehension-driven approach enables efficient exploration, robust generalization, and efficient reasoning, setting a new standard for mixed-policy RL in complex reasoning domains. The empirical results and ablation analyses substantiate the claim that multi-teacher, adaptive guidance is a more effective and scalable path to advanced LLM reasoning than reliance on a single, powerful teacher.