Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 56 tok/s

Gemini 2.5 Pro 39 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 155 tok/s Pro

GPT OSS 120B 476 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Merge-of-Thought Distillation (2509.08814v1)

Published 10 Sep 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Efficient reasoning distillation for long chain-of-thought (CoT) models is increasingly constrained by the assumption of a single oracle teacher, despite practical availability of multiple candidate teachers and growing CoT corpora. We revisit teacher selection and observe that different students have different "best teachers," and even for the same student the best teacher can vary across datasets. Therefore, to unify multiple teachers' reasoning abilities into student with overcoming conflicts among various teachers' supervision, we propose Merge-of-Thought Distillation (MoT), a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging of the resulting student variants. On competition math benchmarks, using only about 200 high-quality CoT samples, applying MoT to a Qwen3-14B student surpasses strong models including DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1, demonstrating substantial gains. Besides, MoT consistently outperforms the best single-teacher distillation and the naive multi-teacher union, raises the performance ceiling while mitigating overfitting, and shows robustness to distribution-shifted and peer-level teachers. Moreover, MoT reduces catastrophic forgetting, improves general reasoning beyond mathematics and even cultivates a better teacher, indicating that consensus-filtered reasoning features transfer broadly. These results position MoT as a simple, scalable route to efficiently distilling long CoT capabilities from diverse teachers into compact students.

Collections

Summary

The paper presents a novel multi-teacher framework that combines teacher-specific supervised fine-tuning with weight-space merging to distill coherent reasoning into a unified student model.
The paper applies its method on Qwen3-14B using just 200 high-quality chain-of-thought samples, achieving significant improvements over existing models on math benchmarks.
The paper demonstrates that the framework mitigates catastrophic forgetting and enhances robustness, paving the way for scalable reasoning in large language models.

Merge-of-Thought Distillation

The paper presents a novel framework, Merge-of-Thought Distillation (MoT), designed to enhance the efficacy of long chain-of-thought (CoT) reasoning in LLMs. The framework addresses the increasing complexity and noise associated with long CoT reasoning by leveraging multiple teacher models rather than relying on a single oracle teacher. MoT combines teacher-specific supervised fine-tuning (SFT) with weight-space merging, providing a systematic approach to distill diverse reasoning styles into a unified student model. This essay will explore the methodology, experimental setup, findings, and implications of MoT.

Methodology

MoT is structured around an iterative process alternating between teacher-specific SFT and weight-space merging. Initially, multiple candidate teacher models generate teacher-specific distillation datasets based on a seed problem. The MoT algorithm then proceeds through iterative rounds, each consisting of three key steps:

Branch Training: Each teacher's reasoning ability is internalized into separate branches through SFT. This stage aligns the student model with each teacher's unique reasoning trajectory.
Weight-Space Merging: The parameters of these separate branches are averaged to consolidate shared reasoning features while diminishing teacher-specific idiosyncrasies.
Next-Round Initialization: The merged model serves as the base for subsequent iterations, gradually evolving into a student that reflects multi-teacher consensus reasoning.
Figure 1: Workflow of Merge-of-Thought Distillation (MoT).

This process aims to unify different teachers' reasoning styles, overcoming conflicts and enhancing the model's reasoning capabilities without amplifying noise, a common issue with long CoT processes.

Experimental Setup

The methodology was applied to QWEN3-14B as the student model, using a dataset of only 200 high-quality CoT samples. The MoT framework was then benchmarked against models like DEEPSEEK-R1, QWEN3 variants, and OPENAI systems on competition math benchmarks.

Datasets and Training

Datasets: The experiments utilized BOBA-200 and S1K-200, derived from high-quality, open-source mathematical problems.
Base Models: QWEN3-8B, QWEN3-14B, and QWEN3-30B-A3B served as the base models for various experiments.
Training Configuration: Fine-tuning involved a batch size of 64, with an initial learning rate of 1e-5. The training process alternated among teacher-distilled corpora for 50 steps before merging, repeated for five rounds.

Findings

Performance and Gains

MoT demonstrated substantial improvements over traditional single-teacher and naive multi-teacher distillation approaches. Specifically, MoT applied to Qwen3-14B surpassed DEEPSEEK-R1, QWEN3-30B, and OPENAI-O1, achieving higher performance ceilings while mitigating overfitting.

Numerical Results: MoT consistently outperformed other models across AIME-sourced benchmarks (AIME24 and AIME25), with substantial numerical gains displayed in terms of average improvements against competing methods.

Robustness to Teacher Variability

The research confirmed that no single teacher emerges as universally best across all students or datasets—a critical assumption MoT capitalizes on. The framework thus effectively integrates diverse reasoning abilities while maintaining robustness to distribution-shifted and peer-level teachers.

Figure 2: Teacher choice is not universal.

Mitigation of Catastrophic Forgetting

MoT also proved advantageous in reducing catastrophic forgetting and enhancing general reasoning, suggesting the broader transferability of consensus-filtered reasoning features.

Figure 3: Reverse-merge probe highlights smoother trajectories under MOT, indicating a flatter loss region.

Conclusion

MoT represents a versatile and scalable approach to distilling long CoT capabilities into compact students while sidestepping the limitations of single-teacher distillation. By unifying diverse teacher signals into a coherent consensus reasoning structure, MoT not only elevates the performance of the student model but also enhances its general reasoning abilities, making it a promising framework for future LLM developments in complex reasoning tasks. Future work could explore alternative merging strategies and further validate the framework's effectiveness across more diverse datasets and domains.