Merge-of-Thought Distillation (MoT)

Updated 11 September 2025

Merge-of-Thought Distillation (MoT) is a framework that integrates heterogeneous teacher rationales into a unified, reasoning-aware student model.
It employs alternating teacher-specific fine-tuning and weight-space merging to consolidate diverse chain-of-thought insights, enhancing benchmark performance.
MoT mitigates catastrophic forgetting while boosting generalization and transferability, paving the way for self-reinforcing teacher-student cycles.

Merge-of-Thought Distillation (MoT) is a framework for reasoning-aware knowledge distillation in LLMs that consolidates the reasoning strengths of multiple heterogeneous teacher models into a single compact student model. MoT specifically addresses the practical and theoretical limitations arising when only a single “oracle” teacher is used for chain-of-thought (CoT) distillation, despite the ready availability of diverse, high-quality CoT corpora and a multitude of candidate teachers. By alternating between teacher-specific supervised fine-tuning branches and weight-space merging of student model variants, MoT yields a student that outperforms single-teacher and simplistic multi-teacher approaches on competition-level reasoning benchmarks, demonstrates robustness to distributional shift, mitigates catastrophic forgetting, and produces transferable reasoning features that can seed the next generation of teachers (Shen et al., 10 Sep 2025).

1. MoT Framework: Alternating Branch Training and Weight-Space Merging

MoT operates in iterative cycles, each consisting of:

Teacher-Specific Supervised Fine-Tuning: The base student model is cloned into $K$ parallel branches, each branch trained independently on distilled CoT data from one teacher. The supervised fine-tuning objective for branch $k$ is

$\mathcal{L}_{\mathrm{SFT}}^{(k)}(\theta) = \mathbb{E}_{(x, r^{(k)}, y)\in \mathcal{D}^{(k)}} \left[\sum_t -\log p_\theta(z_t \mid x, z_{<t})\right],$

where $r^{(k)}$ is the teacher- $k$ rationale, and $\mathcal{D}^{(k)}$ is teacher-specific data, optionally filtered for correctness.

Weight-Space Merging: After branch-wise SFT, the $K$ fine-tuned student variants’ weights are averaged:

$\theta^{(t)} = \frac{1}{K} \sum_{k=1}^K \theta^{(t,k)},$

where $\theta^{(t,k)}$ are the branch-specific weights for round $t$ .

This merge–then–fine-tune cycle is repeated for several rounds. Each cycle reinforces consensus features and suppresses teacher-specific idiosyncrasies or noise, yielding a student model aligned to the reasoning span covered by all sources.

2. Multi-Teacher Distillation and Teacher Selection

Empirical analysis demonstrates the inadequacy of manual, fixed teacher selection: for a fixed student architecture, the “best” teacher can simultaneously differ across datasets, tasks, or checkpoints. MoT responds by:

Aggregating CoT traces and final-answer-aligned references from diverse teachers (e.g., QWQ, QWEN3-32B, DEEPSEEK-R1).
Constructing teacher-specific distillation sets with optional answer-matching to ensure rationale validity.
Training branches with a guarantee that each branch is optimized for its teacher’s “style,” before fusing updates into a consensus parameter set.

This explicit alternation circumvents brittleness and enables exploitation of complementary reasoning patterns. This structure allows a student to generalize beyond what is possible with any individual teacher or a naive union of all distillation corpora.

3. Superiority Over Baseline Distillation Strategies

MoT consistently outperforms both single-teacher and naive multi-teacher (“MTD”) distillation strategies:

Distillation Strategy	Early Training Loss	Final Benchmark Score	Overfitting Risk	Generalization
Single-Teacher (STD)	Lower	Lower ceiling	Higher	Narrow
Naive Multi-Teacher	Variable	Prone to noise	High	Unstable
Merge-of-Thought (MoT)	Moderate	Highest	Lower	Robust

Although STD may achieve lower token-level loss early, MoT attains higher final AIME math scores, raises the achievable “performance ceiling,” mitigates overfitting, and is less sensitive to teacher choice or data idiosyncrasy.
MoT avoids quality dilution and conflicting supervision endemic to naive aggregations by enforcing teacher-specific specialization followed by consensus merge.

4. Performance, Robustness, and Transferability

MoT’s efficacy is empirically validated on competition mathematics benchmarks:

With only 200 high-quality CoT samples, a Qwen3-14B student trained via MoT surpasses DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1.
AVG(AIME24, AIME25) is improved by +3.54–4.86 points over strong baselines, a substantial gain in high-difficulty domains.
MoT generalizes: it is robust to teachers with distributional shift (e.g., including DEEPSEEK-R1, which alone would harm performance), functions effectively even when teachers are peer-level (not strictly stronger), and lifts performance on out-of-mathematics reasoning tasks such as SIMPLE_QA, MMLU, PhyBench, and LiveCodeBench.
The consensus student model not only achieves superior benchmark metrics but, when used as a teacher in follow-on distillation, produces a more effective “student-as-teacher” signal than any of its original sources.

5. Mitigating Catastrophic Forgetting and Raising the Reasoning Ceiling

A persistent issue in CoT distillation is catastrophic forgetting, where focused fine-tuning on reasoning-rich tasks diminishes capabilities on factual recall or lower-level linguistic knowledge. MoT mitigates this by:

Blending the “reasoning features” from multiple teachers, which has a regularizing effect.
Empirical measurements show that MoT incurs smaller performance drops on catastrophic-forgetting–sensitive benchmarks (relative to best-STD), while achieving significant gains on pure reasoning tasks.

Furthermore, through consensus merging, MoT smooths the internal representation of “thoughts,” as observed in reverse-trajectory merge probes, indicating a more robust and transferable reasoning manifold.

6. Key Technical Mechanisms and Mathematical Formulations

The MoT distillation process is mathematically characterized as:

Branch SFT per Teacher: $\mathcal{L}_{\mathrm{SFT}}^{(k)}(\theta)$ as above.
Weight Averaging: $\theta^{(t)} = \frac{1}{K} \sum_{k=1}^K \theta^{(t,k)}$ .
Iterative cycling (for $T$ rounds) to reinforce consensus.
Optional filtering of training trajectories based on final-answer coherence.

No additional architectural modifications are required; all merging is performed at the parameter level.

7. Implications and Future Research

MoT’s paradigm demonstrates that cyclical teacher-specific training followed by weight-space fusion not only consolidates reasoning strengths and raises the performance ceiling, but also yields features with broader transfer and regularization across unrelated reasoning domains.

A promising implication is that recursively using MoT-distilled students as next-generation teachers may further propagate robust CoT features, approaching a self-reinforcing cycle of consensus distillation. Additionally, MoT’s effectiveness with only ~200 high-quality CoT samples suggests high sample efficiency for future work in limited-data or domain-specialized settings.

Structural similarities between MoT’s parameter-space merging and other “branch-merge” (Sun et al., 6 Mar 2025), curriculum, and multi-modality distillation frameworks suggest potential directions for scaling to more complex reasoning settings, richer structures (e.g., mixture- or matrix-of-thought approaches), and applications beyond mathematics.

MoT’s consensus-based weight-fusion strategy conceptually relates to parameter merging in “Branch-Merge Distillation” (Sun et al., 6 Mar 2025), guidance in metastable chain-of-thought search (Kim et al., 2 Feb 2025), and reasoning-awareness in distillation across modalities (Zheng et al., 21 May 2025, Li et al., 2023). The underlying methodology is tightly anchored in empirical findings that diverse supervision and consensus merging yield superior and more robust reasoning models than single-source or naively aggregated approaches.