Learning from Peers in Reasoning Models (2505.07787v1)

Published 12 May 2025 in cs.CL

Abstract: Large Reasoning Models (LRMs) have the ability to self-correct even when they make mistakes in their reasoning paths. However, our study reveals that when the reasoning process starts with a short but poor beginning, it becomes difficult for the model to recover. We refer to this phenomenon as the "Prefix Dominance Trap". Inspired by psychological findings that peer interaction can promote self-correction without negatively impacting already accurate individuals, we propose Learning from Peers (LeaP) to address this phenomenon. Specifically, every tokens, each reasoning path summarizes its intermediate reasoning and shares it with others through a routing mechanism, enabling paths to incorporate peer insights during inference. However, we observe that smaller models sometimes fail to follow summarization and reflection instructions effectively. To address this, we fine-tune them into our LeaP-T model series. Experiments on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond show that LeaP provides substantial improvements. For instance, QwQ-32B with LeaP achieves nearly 5 absolute points higher than the baseline on average, and surpasses DeepSeek-R1-671B on three math benchmarks with an average gain of 3.3 points. Notably, our fine-tuned LeaP-T-7B matches the performance of DeepSeek-R1-Distill-Qwen-14B on AIME 2024. In-depth analysis reveals LeaP's robust error correction by timely peer insights, showing strong error tolerance and handling varied task difficulty. LeaP marks a milestone by enabling LRMs to collaborate during reasoning. Our code, datasets, and models are available at https://learning-from-peers.github.io/ .

Summary

The paper identifies the 'Prefix Dominance Trap' and quantifies its effect on large reasoning models’ performance.
The paper proposes LeaP, a novel approach where parallel reasoning paths share summaries to correct errors and improve inference outcomes.
The paper demonstrates that LeaP significantly reduces performance drops and enhances accuracy across multiple challenging benchmarks.

The paper "Learning from Peers in Reasoning Models" (2505.07787) introduces a novel inference-time strategy called Learning from Peers (LeaP) to address a limitation in large reasoning models (LRMs) referred to as the "Prefix Dominance Trap". This trap describes the phenomenon where a short, low-quality start to a reasoning process significantly hinders the model's ability to self-correct and reach the correct answer. Inspired by psychological findings on the benefits of peer interaction, LeaP enables different parallel reasoning paths of an LRM to share and integrate insights from each other during inference.

The Prefix Dominance Trap

The authors quantify the Prefix Dominance Trap by showing that when LRMs (specifically, DeepSeek-R1-Distill-Qwen series and QwQ-32B) are forced to start their reasoning from the initial tokens of incorrect responses, their performance drops by nearly 20% on benchmarks like AIME 2024, even when these prefixes are only about 15% of the average response length. This highlights the fragility of LRM self-correction in the face of initial errors.

Learning from Peers (LeaP) Method

LeaP integrates cross-path interaction into the standard parallel inference process. Instead of generating multiple reasoning paths independently, LeaP periodically inserts "LeaP blocks" where paths communicate. Each LeaP block consists of two stages:

Summarization: Each reasoning path condenses its current state, insights, and intermediate results into a concise summary (limited to 256 tokens) using dynamic prompts.
Routing: These summaries are shared among peer paths. To manage the information flow, a routing mechanism selects a subset of peer summaries for each path. The paper explores three routing strategies based on Levenshtein similarity between summaries:
- Dispersed Routing: Selects the most dissimilar summaries to introduce diverse perspectives.
- Clustered Routing: Selects the most similar summaries to reinforce aligned reasoning.
- Hybrid Routing: Selects a mix of similar and dissimilar summaries.

After receiving peer summaries, each path can incorporate these insights into its subsequent reasoning, leveraging peer verification to potentially correct errors or explore better approaches.

Evaluation of LeaP

The effectiveness of LeaP is evaluated in two main scenarios:

Mitigating Prefix Dominance: Experiments show that LeaP significantly reduces the performance gap when starting from bad prefixes. For example, on DeepSeek-Distill-Qwen-14B, the performance drop on AIME 2024 is reduced from 19.88% to 7.81%. LeaP also provides improvements when starting from good prefixes, suggesting it helps mitigate subtle errors.
Comprehensive Reasoning Benchmarks: LeaP is evaluated on AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond. LRMs using LeaP show substantial improvements in Pass@1 performance compared to independent reasoning baselines. For instance, QwQ-32B with LeaP (Top-4 Dispersed) achieves nearly 5 points higher on average across benchmarks and surpasses DeepSeek-R1-671B on all three math benchmarks. Dispersed and Hybrid routing generally outperform Clustered and Random routing (presented in Appendix). LeaP also shows improvements in Cons@N metrics.

Efficiency and "Aha" Moments

Analysis of token usage shows that LeaP does not significantly increase the total number of generated tokens compared to baselines. Furthermore, LeaP models exhibit fewer "Aha" moments (keywords indicating self-correction attempts). This suggests that receiving peer insights helps models reach consensus or identify correct paths earlier, potentially reducing the need for extensive self-reflection and redundant exploration.

LeaP-T: Fine-tuned Models

The authors observed that smaller models sometimes struggle to follow summarization and reflection instructions. To address this, they fine-tuned DeepSeek-R1-Distill-Qwen models (1.5B, 7B, 14B) on synthetic data generated using LeaP with a 32B model, creating the LeaP-T series. LeaP-T models are specifically adapted to the LeaP framework. Experiments show that LeaP-T models achieve further performance gains over their base models (with or without standard SFT), particularly on math benchmarks. For example, LeaP-T-7B achieves a Pass@1 score on AIME 2024 comparable to DeepSeek-R1-Distill-Qwen-14B. LeaP-T also demonstrates more efficient test-time scaling than independent parallel reasoning.

In-depth Analysis of LeaP Mechanics

Communication Granularity (T): More frequent communication (smaller interval T between LeaP blocks) generally leads to slightly better performance but consumes more tokens, highlighting a trade-off.
Communication Traffic (Top-k): Performance on AIME 2024 peaks for DeepSeek-R1-Distill-Qwen-14B with Top-4 summaries, suggesting that too few summaries limit information, while too many can cause overload and reduce effectiveness.
Evolution of Communication Types: Analysis shows that peer influence ("Influenced" category) is most impactful in the early to mid-stages of reasoning, decreasing towards the end, while paths increasingly become "Unaffected".
Communication Position (Single Communication): A simplified LeaP variant where communication occurs only once demonstrates that early communication (e.g., at 4K tokens) is more effective than later communication, yielding significant improvements over the baseline even with a single interaction.

Robustness Analysis

LeaP exhibits robustness in practical scenarios:

Error Tolerance: LeaP consistently outperforms baselines even when starting from mixed beginnings with a low proportion of correct paths. Models with LeaP can distill useful signals even from noisy peer summaries, showing strong error tolerance.
Difficulty Levels: LeaP improves accuracy across all difficulty levels, including "Very Hard" problems where the baseline fails completely. This suggests LeaP can help models solve problems previously beyond their reach by facilitating recovery from complete failures. LeaP often uses fewer "reasoning" tokens (excluding peer summaries) than the baseline on harder problems, implying earlier consensus.

Human Verification

Human evaluation of case studies on AIME 2024 shows that many incorrect reasoning paths are corrected after peer communication (Incorrect -> Correct transitions). Crucially, very few correct paths are disrupted (Correct -> Incorrect transitions are rare or absent). This indicates that LeaP primarily acts as a corrective mechanism that leverages peer insights to fix errors without negatively impacting already correct reasoning.

Implementation Considerations

Implementing LeaP involves:

Parallel generation of multiple reasoning paths.
Implementing LeaP blocks at chosen token intervals (T).
Developing prompts for summarization and reflection.
Implementing a routing mechanism (e.g., Levenshtein similarity calculation and Top-k selection).
Managing token consumption and computational resources for summarization and processing peer summaries.

The choice of T and Top-k impacts the trade-off between performance and efficiency. The analysis suggests that Top-4 Dispersed or Hybrid routing and communication in early to mid-stages are effective settings.

Conclusion

The paper successfully identifies the Prefix Dominance Trap and proposes LeaP as an effective method to enhance LRM reasoning by enabling structured peer interaction during parallel inference. The substantial performance gains on challenging benchmarks, the analysis demonstrating error tolerance and robustness across difficulty levels, and the development of the fine-tuned LeaP-T series provide strong evidence for the practical value of this approach. LeaP represents a significant step towards enabling LRMs to collaborate in solving complex problems. Future work could explore integrating LeaP into RL frameworks and leveraging peers with specialized expertise.