Multi-Teacher OPD Distillation

Updated 2 May 2026

Multi-Teacher OPD is a knowledge distillation framework where a student learns from multiple teacher models simultaneously, ensuring diverse and robust learning signals.
It employs adaptive aggregation methods such as weighted ensembling, decision-attention, and router-based selection to compute target distributions and rewards for effective training.
Empirical studies show that MOPD improves performance across modalities—achieving gains in language, vision, and RL tasks while supporting privacy-preserving and efficient training.

Multi-Teacher Output Probability Distillation (MOPD), also called multi-teacher OPD, refers to a family of knowledge distillation techniques wherein a student model (or learner) absorbs information simultaneously from multiple teacher models or sources. Rather than relying on a single teacher—which can lead to limited or biased student performance—MOPD approaches integrate diverse knowledge streams, leveraging several experts, prompts, distributions, or peer policies. These frameworks have demonstrated consistent improvements across language, vision, and reinforcement learning tasks by carefully specifying how teacher outputs are combined, how rewards or losses are constructed, and how assignment or aggregation mechanisms select optimal teachers per example, task, or environment.

1. Theoretical Foundations and General Formulation

MOPD extends standard output probability distillation, in which the student seeks to imitate a single teacher’s output probabilities (soft labels), to the case of multiple teachers. In a basic setting, given $K$ teacher models $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ , one constructs target distributions, reward functions, or aggregate loss terms that synthesize their contributions. MOPD can be seen as a special case of dense KL-constrained RL or supervised fine-tuning where teacher outputs are ensembled or individually weighted in guiding the student’s learning signal (Yang et al., 12 Feb 2026, Wu et al., 2021).

Let $\pi_\theta$ denote the student, $\pi^{*(k)}$ the $k$ -th teacher, and $w_k$ weights (typically summing to 1). Target probability distributions may be combined as

$p_\text{target}(y|x) = \sum_{k=1}^K w_k\,\pi^{*(k)}(y|x)$

and reward functions for sequence-level or trajectory-based learning may similarly be aggregated: $r(\tau) = \sum_{k=1}^K w_k \left[\sum_{t=1}^T \log \pi^{*(k)}(y_t|x, y_{<t}) - \log \pi_\text{ref}(y_t|x, y_{<t})\right]$ with $\pi_\text{ref}$ a reference policy, e.g., a base model or a fixed mixture (Yang et al., 12 Feb 2026).

Therefore, the student receives dense distillation feedback not just from a single teacher distribution but (often adaptively) from a spectrum of expert models. Variants may involve soft ensembling, reward extrapolation, personalized assignment, privacy-preserving decentralized optimization, or mutual online peer distillation.

2. Key MOPD Instantiations and Methodologies

MOPD encompasses diverse algorithmic implementations tailored to different modalities and settings:

(a) Generalized On-Policy Distillation with Reward Extrapolation

The G-OPD framework introduces a reward scaling factor $\alpha>0$ , controlling the weight of aggregate multi-teacher rewards relative to KL regularization. For $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 0 (ExOPD), the reward signal is extrapolated, enabling the student to exceed all individual domain teacher performances. The reference policy is formed as a mixture of base student models, and trajectory-level rewards combine per-teacher log-probability shifts with weights $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 1. The final loss is the sum of scaled reward and dense KL, and optimization proceeds via gradient descent over student parameters (Yang et al., 12 Feb 2026).

(b) Online Policy Distillation with Decision-Attention

In MOPD-DA, an ensemble of policies simultaneously learns via RL and serves as teachers for one another. Each policy forms its group-derived distillation targets using a decision-attention mechanism—a scaled dot-product attention over peer outputs—yielding distinct, state-dependent weighting for each peer. Losses include a standard RL loss, a KL-based decision (output) loss, and a mean-squared error feature loss. This bidirectional, online, mutual distillation lifts all ensemble members beyond independent training baselines by facilitating complementary knowledge transfer while avoiding homogenization (Yu et al., 2024).

(c) Routing-Based Multi-Teacher Distillation (PerSyn)

Here, a student model is distilled using synthetic data generated by a set of teacher models. For each input prompt, a lightweight router determines which teacher is optimal based on a reward $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 2, blending response quality and student learnability. Instead of generating all possible teacher responses and selecting post-hoc (as in "Generate then Select"), PerSyn’s "Route then Generate" paradigm uses the trained router to pick one teacher per prompt, substantially improving computational efficiency without sacrificing accuracy. The student is then supervised on the selected responses, achieving consistent gains over single-teacher and naive multi-teacher baselines (Zhang et al., 13 Oct 2025).

(d) Decentralized and Privacy-Preserving MOPD

Collaborative machine teaching via consensus optimization addresses data privacy and communication efficiency. Here, $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 3 distributed teachers each select informative local data subsets without sharing raw data, and coordinate with a learner through dual variables and low-dimensional aggregates. The process converges to a globally optimal teaching risk, often using adaptive Lasso-style sparsity and block-coordinate descent, and is applicable to convex learners (e.g., logistic regression) (Han et al., 2019).

(e) Mixture-of-Prompts Distillation for Vision-Language

MoPD leverages a set of hand-crafted (“hard”) teacher prompts in vision-LLMs (e.g., CLIP). A gating network predicts image-specific weights over teacher prompts, and a soft (student) prompt is trained via a combination of cross-entropy, KL-based distillation, and mixture-of-prompts selection losses. After optimization, the student soft prompt generalizes better to unseen classes and shots, as the gating ensures knowledge transfer from a diverse prompt pool (Chen et al., 2024).

(f) Multi-Teacher Output Probability Distillation in LLMs

MT-BERT serves as a LLM compression framework leveraging co-finetuned teacher ensembles, a reliability-weighted output probability distillation loss, and a hidden-state alignment loss. Teachers’ outputs are first aligned via shared pooling and heads; the student is then trained with soft label and feature matching, with teacher reliabilities determined per-example (Wu et al., 2021).

3. Aggregation, Assignment, and Weighting Schemes

A critical component of MOPD frameworks is the specification of how teacher outputs are combined. Common strategies include:

Fixed-weight ensembling: Arithmetic averaging or weighted sum of teacher softmax (probability) outputs, with $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 4 possibly reflecting domain balance or prior expertise.
Adaptive reliability weighting: Example-specific reliabilities $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 5 are computed as functions of teacher accuracy or cross-entropy with gold labels, prioritizing more accurate or aligned teachers (Wu et al., 2021).
Attention-based aggregation: State-conditional attention (e.g., decision-attention) dynamically computes aggregation weights via dot-product mechanisms over peer outputs (Yu et al., 2024).
Router-based assignment: Query-level routers learn to dispatch each instance to its optimal teacher, optimizing a mixture of output quality and student learnability as a reward function (Zhang et al., 13 Oct 2025).
Prompt gating: A gating network predicts image-dependent weights over teacher prompts, selecting a sparse mixture for distillation (Chen et al., 2024).

The aggregation mechanism crucially determines the diversity/consensus tradeoff, cross-domain transfer potential, computational efficiency, and robustness to noisy or suboptimal teachers.

4. Empirical Results, Ablations, and Comparative Performance

Multi-teacher OPD consistently outperforms single-teacher, naive ensemble, or independent training baselines across modalities:

Language and code reasoning: ExOPD (with $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 6) is the only method to produce a multi-domain student surpassing every domain expert on all math/code benchmarks; ablations show a U-shaped sensitivity to $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 7 with optimal values in $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 8 (Yang et al., 12 Feb 2026).
Instruction tuning and math synthesis: Router-guided distillation (PerSyn) achieves 2–8% absolute accuracy gains over prior single- and multi-teacher strategies, especially in matching the quality/learnability frontier (Zhang et al., 13 Oct 2025).
Vision-language prompt adaptation: MoPD yields a +0.90 harmonic mean improvement across 11 datasets, with robust performance even under noisy teacher prompt pools. Performance is sensitive to the number of selected prompts (optimal $\{\pi^{*(1)}, ..., \pi^{*(K)}\}$ 9 or $\pi_\theta$ 0) and gating efficacy (Chen et al., 2024).
Reinforcement learning (Atari): Online multi-peer distillation with decision-attention yields consistent 40–50% improvements over independent RL baselines; attention-based aggregation is crucial, as uniform aggregation degrades results by up to 72% (Yu et al., 2024).
LLM compression: MT-BERT, integrating MOPD and hidden-state loss, surpasses DistilBERT, TinyBERT, and PKD by 1–5 accuracy points and further improves upon naive teacher ensembling (Wu et al., 2021).
Privacy-preserving federated teaching: Consensus-based MOPD achieves optimal teaching risk with a dramatically reduced trainset (3–7% of points needed) and communication cost, converging efficiently even on large real-world benchmarks (Han et al., 2019).

Ablations uniformly emphasize the loss of performance when removing adaptive aggregation (router/gating/attention), teacher weighting, or multi-expert alignment steps.

5. Practical Considerations and Hyperparameter Sensitivities

Successful MOPD design requires several critical choices:

Reference model selection: In RL-style MOPD, using the base model prior to RL as the reference enables more accurate reward signals, at the expense of higher compute overhead (Yang et al., 12 Feb 2026).
Reward scaling $\pi_\theta$ 1 and balance: Excessively large $\pi_\theta$ 2 destabilizes training; empirically, $\pi_\theta$ 3 for G-OPD and $\pi_\theta$ 4 in [1e–4, 1e–1] for MoPD yield optimal gains (Yang et al., 12 Feb 2026, Chen et al., 2024).
Teacher pool curation: In routing and prompt-based schemes, most queries are best routed to intermediate-scale, in-family teachers, with only the hardest examples leveraging specialist teachers (Zhang et al., 13 Oct 2025).
Router/gating network training: Router achieves $\pi_\theta$ 5 Hit@3 accuracy with only 2.5K annotated pairs, with performance plateauing beyond this (Zhang et al., 13 Oct 2025).
Loss composition: Equal or slightly learnability-favoring ( $\pi_\theta$ 6) weighting between output quality and learnability is most effective (Zhang et al., 13 Oct 2025). Gating network performance is robust to noisy or distractor prompts (Chen et al., 2024).
Computational efficiency: Route-then-generate reduces multi-teacher forward passes from $\pi_\theta$ 7 to $\pi_\theta$ 8; decentralized consensus MOPD achieves $\pi_\theta$ 9 communication per round (Zhang et al., 13 Oct 2025, Han et al., 2019).

6. Limitations and Future Directions

MOPD approaches remain subject to several limitations:

Task scope and learner assumptions: Some instantiations are confined to convex learners or require knowledge of an oracle target $\pi^{*(k)}$ 0 (Han et al., 2019).
Reference and teacher access: Certain performance improvements require access to (potentially unavailable) pre-RL or pre-finetune model checkpoints (Yang et al., 12 Feb 2026).
Privacy and communication: While decentralized MOPD preserves privacy over raw data, formal differential privacy is not generally guaranteed (Han et al., 2019).
Noisiness of teacher pool: Performance can degrade if the aggregation mechanism is ineffective in downweighting uninformative or adversarial teachers (Chen et al., 2024).
Optimal assignment learning: Although router-based assignment matches “oracle” performance, no formal global convergence guarantees are provided; empirical sufficiency is established (Zhang et al., 13 Oct 2025).

Suggested directions include extending consensus MOPD to non-convex neural learners, integrating stronger privacy mechanisms, developing asynchronous update protocols, and exploring richer aggregation architectures (e.g., metaprompt learning or dynamic teacher selection).

7. Summary Table: MOPD Approaches and Domains

Paper / Method	Aggregation/Assignment	Application Domain(s)
(Yang et al., 12 Feb 2026) G-OPD/ExOPD	Weighted sum, reward extrap.	RL, math, code LMs
(Yu et al., 2024) OPD-DA	Attention over outputs	RL (Atari, policy distil)
(Zhang et al., 13 Oct 2025) PerSyn	Query-level router	LLMs, synthesis (math/instr)
(Han et al., 2019) Consensus MOPD	Decentralized sparse optim.	Convex learning, privacy
(Chen et al., 2024) MoPD	Image-gated prompt mixing	Vision-LLMs
(Wu et al., 2021) MT-BERT	Reliability-weighted logits	LLM distil.

In summary, MOPD frameworks enable robust and transferable student learning by systematically integrating information from multiple, diverse teachers, with aggregation and assignment mechanisms specified according to task modality, privacy, efficiency, and cross-teacher complementarities. The field continues to innovate on aggregation schemes, adaptive assignment, and privacy-aware large-scale deployment.