LeaP-T Series: Fine-Tuned Reasoning Models
- LeaP-T Series is a family of fine-tuned large reasoning models that enable structured peer-to-peer communication for enhanced chain-of-thought reasoning.
- It uses continued supervised fine-tuning with explicit instruction tokens for summarization and peer integration to effectively mitigate the Prefix Dominance Trap.
- Empirical evaluations reveal substantial improvements in Pass@1 and majority-vote accuracy on challenging mathematical benchmarks using models from 1.5B to 14B parameters.
The LeaP-T Series denotes a family of fine-tuned Large Reasoning Models (LRMs) designed to enhance multi-chain, collaborative chain-of-thought (CoT) reasoning by enabling structured peer-to-peer communication. The LeaP-T adaptation builds on the Learning from Peers (LeaP) paradigm by endowing smaller-scale transformer models with robust, instruction-following abilities for peer summarization, reflection, and routing within mathematical reasoning benchmarks. The central aim is to address the Prefix Dominance Trap—a failure mode where a short but erroneous reasoning prefix dominates subsequent inference, causing persistent errors that conventional self-verification cannot correct. LeaP-T achieves robust error correction, superior single-sample (Pass@1) performance, and majority-vote accuracy on competitive math and multi-step evaluation tasks, matching or surpassing much larger baseline models using only 7B or 14B parameters (Luo et al., 12 May 2025).
1. Model Foundations and Architectural Design
The LeaP-T series originates from the DeepSeek-R1-Distill-Qwen transformer backbone, with variant sizes at 1.5B, 7B, and 14B parameters (R1-1.5B, R1-7B, R1-14B), and also leverages QwQ-32B for inference-time peer generation. The adaptation to the LeaP-T protocol is realized exclusively through continued supervised fine-tuning (SFT) with no modification to the underlying transformer architecture. The protocol integrates explicit instruction tokens for “Summarize” and “Incorporate peers” actions within CoT sequences, ensuring that models internalize and reliably execute peer-interaction logic during downstream inference. This addresses an observed limitation where smaller LRMs fail to consistently execute complex, multi-instruction reasoning without tailored SFT exposure (Luo et al., 12 May 2025).
2. Supervised Fine-Tuning and Data Generation
Fine-tuning utilizes ∼1,000 AIME problems (years 1984–2023) as source data. Each input is processed by the LeaP inference engine on QwQ-32B, yielding 32 distinct CoT traces augmented with intermediate peer summaries. Only answer-correct, summary-compliant (≤256 tokens per summary) traces are retained. As control, standard CoT SFT traces without peer interaction are also assembled. The autoregressive cross-entropy loss is applied across the concatenated sequence: problem statement, periodic “Summarize” and “Incorporate peers” segments, and completion up to the answer. The loss is not decomposed: all instruction-following and peer incorporating behaviors are learned end-to-end.
| Hyperparameter | 1.5B | 7B | 14B |
|---|---|---|---|
| Batch size | 16 | 16 | 16 |
| Learning rate | |||
| LR scheduler | Cosine decay | Cosine decay | Cosine decay |
| Warmup ratio | 5% | 5% | 5% |
| Optimizer | AdamW | AdamW | AdamW |
| Weight decay | |||
| Max seq length | 16,384 | 16,384 | 16,384 |
| Epochs | 8 | 8 | 5 |
| Precision | bfloat16 | bfloat16 | bfloat16 |
No explicit auxiliary losses are used for summarization or routing behaviors; the model learns comprehensive instruction adherence through SFT (Luo et al., 12 May 2025).
3. Peer Routing and Interaction Framework
Inference-time chain evolution is organized around periodic “LeaP blocks,” where parallel chains each execute the following process every tokens:
- Summarization: Each chain emits a ≤256-token summary of its current reasoning state.
- Routing: For every pair of summaries , normalized Levenshtein similarity sim is calculated to quantify textual proximity:
- Peer Context Selection: Chains select peer summaries for integration, using one of three schemes:
- Clustered: Top- most similar summaries.
- Dispersed: Bottom- least similar summaries.
- Hybrid: A mix of top and bottom .
- The selected peer summaries are appended to each chain’s context before continuing token sampling.
- This cyclical summarization and routing framework allows collective correction and information sharing among reasoning paths, providing resilience against early, locally-dominant errors (Luo et al., 12 May 2025).
4. Benchmark Evaluation and Empirical Findings
LeaP-T models are evaluated on Pass@1 (single-sample correctness) and Cons@N (majority correctness) across AIME 2024, AIME 2025, AIMO 2025, and GPQA Diamond. Results indicate:
- Substantial performance gains:
- R1-7B with vanilla LeaP improves +6 points over independent sampling; LeaP-T-7B adds +2–3 more, reaching 64.38% Pass@1 (matching R1-14B SFT baseline) on AIME 2024.
- R1-14B with LeaP(-T) achieves 60.66% average Pass@1, closing over 40% of the math performance gap to DeepSeek-R1-671B using only 14B parameters.
- Error tolerance:
- On AIME 2024, even when 0% of prefix chains begin correctly, R1-14B+LeaP achieves 41.9% Pass@1 vs. baseline 28.8%.
- On very hard problems with baseline Pass@1 = 0%, LeaP models achieve nonzero accuracy, e.g., ~10%.
- Robust majority voting:
- Cons@N scores consistently increase, reflecting more reliable consensus across peer chains.
| Model | AIME 24 | AIME 25 | AIMO 25 | GPQA | Avg. | Cons@N |
|---|---|---|---|---|---|---|
| R1-1.5B Baseline | 32.00 | 24.69 | 14.00 | 46.91 | 29.40 | 36.67 |
| +SFT | 31.04 | 23.23 | 15.31 | 52.97 | 30.89 | 41.11 |
| +LeaP | 34.90 | 26.46 | 15.63 | 53.28 | 32.57 | 35.56 |
| LeaP-T-1.5B | 37.08 | 26.67 | 20.31 | 55.56 | 34.90 | 45.56 |
| ... | ... | ... | ... | ... | ... | ... |
| LeaP-T-14B | 76.46 | 54.27 | 52.50 | 57.42 | 60.66 | 71.11 |
Key takeaways are that even without parameter scaling, structured peer reasoning enables substantial performance improvements (Luo et al., 12 May 2025).
5. Analysis: Prefix Dominance Trap and Error Correction
The Prefix Dominance Trap describes a failure mode during chain-of-thought reasoning where an initial, brief (e.g., 15% of reasoning steps), but incorrect prefix, determines the remainder of the solution path, leading to a ∼20 percentage point drop in Pass@1 on independent chains. The LeaP protocol reduces this error gap by ~10 points, demonstrating that peer summarization and integration can rescue chains from local minima. Further robustness analysis shows that LeaP-T models remain resilient under varying fractions of good/bad prefixes and maintain consistent gains across the full problem difficulty spectrum. Reflective “aha” moments (where models self-correct via explicit reflection) are reduced by ~16%, indicating that collaborative peer input offloads some of the burden traditionally placed on self-verification (Luo et al., 12 May 2025).
Case studies show that peer integration is strongly net-positive: correct-to-incorrect answer flips are essentially absent, while incorrect-to-correct flips occur in ~40% of cases examined.
6. Significance and Implications
LeaP-T demonstrates that small and mid-scale transformer models, when appropriately fine-tuned with structured peer interaction instructions, can close a substantial portion of the performance gap to much larger state-of-the-art reasoning systems. The SFT regime tailored to the LeaP protocol not only improves instruction-following behaviors but also mitigates previously hard-to-correct failure modes such as the Prefix Dominance Trap. These findings suggest that collaborative path sampling and mid-chain reflection, orchestrated by explicit, learned instructions, represent a scalable technique for error correction and robustness in multi-step symbolic and mathematical reasoning. LeaP-T provides a framework for transforming individual, self-verifying LLMs into collectives that benefit from structured, periodic information exchange, thereby enabling new avenues for research in collaborative reasoning model architectures (Luo et al., 12 May 2025).