Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 65 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

Merge-of-Thought Distillation (MoT)

Updated 11 September 2025
  • Merge-of-Thought Distillation (MoT) is a framework that integrates heterogeneous teacher rationales into a unified, reasoning-aware student model.
  • It employs alternating teacher-specific fine-tuning and weight-space merging to consolidate diverse chain-of-thought insights, enhancing benchmark performance.
  • MoT mitigates catastrophic forgetting while boosting generalization and transferability, paving the way for self-reinforcing teacher-student cycles.

Merge-of-Thought Distillation (MoT) is a framework for reasoning-aware knowledge distillation in LLMs that consolidates the reasoning strengths of multiple heterogeneous teacher models into a single compact student model. MoT specifically addresses the practical and theoretical limitations arising when only a single “oracle” teacher is used for chain-of-thought (CoT) distillation, despite the ready availability of diverse, high-quality CoT corpora and a multitude of candidate teachers. By alternating between teacher-specific supervised fine-tuning branches and weight-space merging of student model variants, MoT yields a student that outperforms single-teacher and simplistic multi-teacher approaches on competition-level reasoning benchmarks, demonstrates robustness to distributional shift, mitigates catastrophic forgetting, and produces transferable reasoning features that can seed the next generation of teachers (Shen et al., 10 Sep 2025).

1. MoT Framework: Alternating Branch Training and Weight-Space Merging

MoT operates in iterative cycles, each consisting of:

  • Teacher-Specific Supervised Fine-Tuning: The base student model is cloned into KK parallel branches, each branch trained independently on distilled CoT data from one teacher. The supervised fine-tuning objective for branch kk is

LSFT(k)(θ)=E(x,r(k),y)D(k)[tlogpθ(ztx,z<t)],\mathcal{L}_{\mathrm{SFT}}^{(k)}(\theta) = \mathbb{E}_{(x, r^{(k)}, y)\in \mathcal{D}^{(k)}} \left[\sum_t -\log p_\theta(z_t \mid x, z_{<t})\right],

where r(k)r^{(k)} is the teacher-kk rationale, and D(k)\mathcal{D}^{(k)} is teacher-specific data, optionally filtered for correctness.

  • Weight-Space Merging: After branch-wise SFT, the KK fine-tuned student variants’ weights are averaged:

θ(t)=1Kk=1Kθ(t,k),\theta^{(t)} = \frac{1}{K} \sum_{k=1}^K \theta^{(t,k)},

where θ(t,k)\theta^{(t,k)} are the branch-specific weights for round tt.

This merge–then–fine-tune cycle is repeated for several rounds. Each cycle reinforces consensus features and suppresses teacher-specific idiosyncrasies or noise, yielding a student model aligned to the reasoning span covered by all sources.

2. Multi-Teacher Distillation and Teacher Selection

Empirical analysis demonstrates the inadequacy of manual, fixed teacher selection: for a fixed student architecture, the “best” teacher can simultaneously differ across datasets, tasks, or checkpoints. MoT responds by:

  • Aggregating CoT traces and final-answer-aligned references from diverse teachers (e.g., QWQ, QWEN3-32B, DEEPSEEK-R1).
  • Constructing teacher-specific distillation sets with optional answer-matching to ensure rationale validity.
  • Training branches with a guarantee that each branch is optimized for its teacher’s “style,” before fusing updates into a consensus parameter set.

This explicit alternation circumvents brittleness and enables exploitation of complementary reasoning patterns. This structure allows a student to generalize beyond what is possible with any individual teacher or a naive union of all distillation corpora.

3. Superiority Over Baseline Distillation Strategies

MoT consistently outperforms both single-teacher and naive multi-teacher (“MTD”) distillation strategies:

Distillation Strategy Early Training Loss Final Benchmark Score Overfitting Risk Generalization
Single-Teacher (STD) Lower Lower ceiling Higher Narrow
Naive Multi-Teacher Variable Prone to noise High Unstable
Merge-of-Thought (MoT) Moderate Highest Lower Robust
  • Although STD may achieve lower token-level loss early, MoT attains higher final AIME math scores, raises the achievable “performance ceiling,” mitigates overfitting, and is less sensitive to teacher choice or data idiosyncrasy.
  • MoT avoids quality dilution and conflicting supervision endemic to naive aggregations by enforcing teacher-specific specialization followed by consensus merge.

4. Performance, Robustness, and Transferability

MoT’s efficacy is empirically validated on competition mathematics benchmarks:

  • With only 200 high-quality CoT samples, a Qwen3-14B student trained via MoT surpasses DEEPSEEK-R1, QWEN3-30B-A3B, QWEN3-32B, and OPENAI-O1.
  • AVG(AIME24, AIME25) is improved by +3.54–4.86 points over strong baselines, a substantial gain in high-difficulty domains.
  • MoT generalizes: it is robust to teachers with distributional shift (e.g., including DEEPSEEK-R1, which alone would harm performance), functions effectively even when teachers are peer-level (not strictly stronger), and lifts performance on out-of-mathematics reasoning tasks such as SIMPLE_QA, MMLU, PhyBench, and LiveCodeBench.
  • The consensus student model not only achieves superior benchmark metrics but, when used as a teacher in follow-on distillation, produces a more effective “student-as-teacher” signal than any of its original sources.

5. Mitigating Catastrophic Forgetting and Raising the Reasoning Ceiling

A persistent issue in CoT distillation is catastrophic forgetting, where focused fine-tuning on reasoning-rich tasks diminishes capabilities on factual recall or lower-level linguistic knowledge. MoT mitigates this by:

  • Blending the “reasoning features” from multiple teachers, which has a regularizing effect.
  • Empirical measurements show that MoT incurs smaller performance drops on catastrophic-forgetting–sensitive benchmarks (relative to best-STD), while achieving significant gains on pure reasoning tasks.

Furthermore, through consensus merging, MoT smooths the internal representation of “thoughts,” as observed in reverse-trajectory merge probes, indicating a more robust and transferable reasoning manifold.

6. Key Technical Mechanisms and Mathematical Formulations

The MoT distillation process is mathematically characterized as:

  • Branch SFT per Teacher: LSFT(k)(θ)\mathcal{L}_{\mathrm{SFT}}^{(k)}(\theta) as above.
  • Weight Averaging: θ(t)=1Kk=1Kθ(t,k)\theta^{(t)} = \frac{1}{K} \sum_{k=1}^K \theta^{(t,k)}.
  • Iterative cycling (for TT rounds) to reinforce consensus.
  • Optional filtering of training trajectories based on final-answer coherence.

No additional architectural modifications are required; all merging is performed at the parameter level.

7. Implications and Future Research

MoT’s paradigm demonstrates that cyclical teacher-specific training followed by weight-space fusion not only consolidates reasoning strengths and raises the performance ceiling, but also yields features with broader transfer and regularization across unrelated reasoning domains.

A promising implication is that recursively using MoT-distilled students as next-generation teachers may further propagate robust CoT features, approaching a self-reinforcing cycle of consensus distillation. Additionally, MoT’s effectiveness with only ~200 high-quality CoT samples suggests high sample efficiency for future work in limited-data or domain-specialized settings.

Structural similarities between MoT’s parameter-space merging and other “branch-merge” (Sun et al., 6 Mar 2025), curriculum, and multi-modality distillation frameworks suggest potential directions for scaling to more complex reasoning settings, richer structures (e.g., mixture- or matrix-of-thought approaches), and applications beyond mathematics.

MoT’s consensus-based weight-fusion strategy conceptually relates to parameter merging in “Branch-Merge Distillation” (Sun et al., 6 Mar 2025), guidance in metastable chain-of-thought search (Kim et al., 2 Feb 2025), and reasoning-awareness in distillation across modalities (Zheng et al., 21 May 2025, Li et al., 2023). The underlying methodology is tightly anchored in empirical findings that diverse supervision and consensus merging yield superior and more robust reasoning models than single-source or naively aggregated approaches.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube