Multi-teacher Domain-Routed OPD

Updated 30 June 2026

The paper extends standard OPD by integrating multiple domain-specific RL teachers that guide a student policy with dense, token-level supervision.
It employs a rigorous domain routing mechanism that assigns each student trajectory to the appropriate teacher, thereby reducing cross-domain interference.
Empirical results show that this approach improves performance in math, code, and image tasks, outperforming individual teacher baselines.

Multi-teacher Domain-routed On-Policy Distillation (OPD) is an advanced paradigm for integrating heterogeneous, domain-specialized expertise into a single student policy by leveraging on-policy trajectories and dense, token- or step-level teacher supervision. This class of algorithms extends the standard On-Policy Distillation approach by introducing multiple domain-specific RL teachers and task/domain routing mechanisms, enabling models to inherit (and in some cases surpass) the domain capabilities of their expert teachers, while mitigating negative interference and exposure bias.

1. Theoretical Foundations of Generalized On-Policy Distillation

Generalized On-Policy Distillation (G-OPD) formalizes OPD as a dense KL-constrained RL problem. The standard OPD objective for student parameters $\theta$ is

$J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$

where $\pi^*$ is a fixed (“teacher”) policy and sampling occurs over the student’s own trajectories. Yang et al. showed that this is equivalent to maximizing a dense reward

$r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$

with a KL penalty to an arbitrary reference policy $\pi_{\textrm{ref}}$ , always weighted equally ( $\beta = 1$ ): $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x, y \sim \pi_\theta} \left[ r(x, y) - KL(\pi_\theta(\cdot|x) \,\|\, \pi_{\textrm{ref}}(\cdot|x)) \right]$ G-OPD generalizes OPD by introducing a reward scaling factor $\lambda \equiv 1/\beta$ : $J_{\mathrm{G\text{-}OPD}}(\theta) = \mathbb{E}_{x, y} \left[ \lambda \cdot \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)} - KL(\pi_\theta(\cdot|x)\,\|\,\pi_{\textrm{ref}}(\cdot|x)) \right]$ When $\lambda=1$ and $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 0 is the student’s initialization, standard OPD is recovered; $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 1 enables reward extrapolation, allowing the student to surpass the teacher in target domains (Yang et al., 12 Feb 2026).

2. Multi-Teacher Domain-Routed Distillation Algorithms

Multi-teacher, domain-routed OPD frameworks extend G-OPD by training multiple independent domain teachers from a shared base, then orchestrating their integration through a strict domain-routing regime.

Domain-specialized RL teachers: Each RL teacher $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 2 is fine-tuned from a base policy $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 3 using domain-specific data $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 4 and reward structures optimized for its target objective (e.g., math, code, or image caption quality). Teacher policies are fixed (“frozen”) after RL fine-tuning.
Prompt/task routing: During student distillation, each datapoint or trajectory is tagged with its domain. At each training step, rollouts from the student are compared to the logits or action distributions of the corresponding domain teacher only.
Distillation loss: The typical loss for token-level integration can be formulated as a reverse KL divergence or a reward-weighted policy gradient, e.g.:

$J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 5

where $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 6 is routed per-prompt (Ma et al., 29 Jun 2026).

Practical orchestration: Student on-policy rollouts are routed to domain-appropriate teacher(s). Losses are accumulated only for the teacher matched by the router, mitigating cross-domain interference.

3. Domain Routing, Task Gating, and Debate Variants

Domain routing in multi-teacher OPD employs explicit mapping of prompts to their origin domain, ensuring that each teacher only supervises in-domain trajectories. Simple label-based task routers suffice for most LLM and flow models (Yang et al., 12 Feb 2026, Fang et al., 8 May 2026, Ma et al., 29 Jun 2026). In more advanced settings such as MAD-OPD, domain routing is extended with collective supervision and task-adaptive divergence selection:

MAD-OPD: Multiple teachers conduct $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 7 rounds of “debate” per student rollout, producing a debate transcript $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 8. Confidence weights $J_{\mathrm{OPD}}(\theta) = \mathbb{E}_{x \sim D,\, y \sim \pi_\theta(\cdot|x)} \Bigl[- KL\bigl(\pi_\theta(y|x)\,\|\,\pi^*(y|x)\bigr) \Bigr]$ 9 (computed from teacher self-reported confidence via softmax) produce a collective distribution:

$\pi^*$ 0

(Wang et al., 2 May 2026).

Task-adaptive divergence: The optimal divergence for distillation is adaptively selected by domain. JSD (Jensen-Shannon divergence) is used for multi-step agentic tasks for stability, while reverse KL is mode-seeking and optimal for open-ended code/text generation. This mitigates the risk of policy collapse and exposure bias in settings with heterogeneous task types.

4. Workflow and Algorithmic Implementation

A unified workflow for multi-teacher, domain-routed OPD is:

Teacher Preparation: Independently fine-tune a set of domain-specific RL teacher policies from a shared initialization using specialized RL recipes (GRPO, PPO, etc.).
Student Initialization: Initialize the student from the pre-trained base, SFT, or merged teacher checkpoints (flow-based merging is used in generative models (Fang et al., 8 May 2026)).
On-policy rollouts: The student performs rollouts on prompts $\pi^*$ 1 sampled from a multi-domain pool, producing student trajectories $\pi^*$ 2.
Task/Domain Routing: For each trajectory, route supervision to the appropriate teacher (by simple label, learned gate, or collective debate).
Distillation Update: At each timestep, compute the per-token or per-action distillation objective (reverse KL, JSD, or other), and aggregate for a gradient step. Reward scaling ( $\pi^*$ 3) and flexible reference models are optionally included (Yang et al., 12 Feb 2026).
Specialized correction: In “strong-to-weak” distillation (teacher significantly larger than student), reward correction by setting the reference as the teacher’s pre-RL base further refines the reward signal and accuracy (Yang et al., 12 Feb 2026).

A canonical pseudocode sketch for domain-routed OPD appears in (Yang et al., 12 Feb 2026):

$\pi_{\textrm{ref}}$ 0

5. Empirical Results and Benchmarks

Extensive experimental evaluation demonstrates:

Single-teacher regime (ExOPD, $\pi^*$ 4): For math and code domains, Qwen3-4B student distilled from RL-Math/Code teachers peaks at $\pi^*$ 5; e.g., ExOPD(1.25) yields $\pi^*$ 6 accuracy vs. OPD( $\pi^*$ 7) $\pi^*$ 8 and teacher $\pi^*$ 9 on AIME24 (Yang et al., 12 Feb 2026).
Multi-teacher aggregation: When merging distinct domain RL experts, ExOPD and MOPD approaches produce a single student that consistently surpasses all domain teachers—e.g., in Qwen3-30B-A3B, MOPD yields a normalized capability score of $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 0 vs. Mix-RL ( $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 1), Cascade RL ( $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 2), and Param-Merge ( $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 3) (Ma et al., 29 Jun 2026).
Robustness and transfer: Reward correction in strong-to-weak settings provides an additional $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 4– $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 5 accuracy point (Yang et al., 12 Feb 2026).
Stability requirements: For both MOPD and Flow-OPD, teacher–student initialization overlap (“same-origin”) and careful choice of divergence/type are crucial for convergence and entropy preservation (Ma et al., 29 Jun 2026, Fang et al., 8 May 2026).
Image generation: Flow-OPD lifts GenEval from $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 6 to $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 7 and OCR-Acc from $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 8 to $r(x, y) = \log \frac{\pi^*(y|x)}{\pi_{\textrm{ref}}(y|x)}$ 9 (SD 3.5 M), demonstrating scalability and fidelity preservation in non-autoregressive models (Fang et al., 8 May 2026).

6. Comparative Analysis and Methodological Variants

A summary of methods for multi-domain expert integration:

Method	Dense token-level	On-policy	Parallel teacher RL	Exposure bias
Param-Merge	✗	—	✓	n/a
Off-Policy Finetune	✓	✗	✓	severe
Mix-RL	✗	✓	✗	no
Cascade RL	✗	✓	✗	no
MOPD (multi-teacher OPD)	✓	✓	✓	none

Only the multi-teacher OPD family offers dense token-level feedback, exposure-bias-free on-policy rollouts, and complete parallelizability of expert teacher construction (Ma et al., 29 Jun 2026).

7. Extensions and Domain-Adapted Innovations

Recent innovations include:

Flow-OPD adapts multi-teacher OPD to Flow Matching text-to-image models. It introduces a task-routing labeling scheme and an auxiliary “manifold anchor regularization” (MAR) penalty using a task-agnostic teacher to safeguard generative quality (Fang et al., 8 May 2026).
MAD-OPD employs a multi-agent debate to form a collective teacher distribution, which addresses the single-teacher capability ceiling and is stable even in agentic RL environments by routing the loss divergence type per domain (Wang et al., 2 May 2026).
Iterative capability accumulation: Multi-round OPD/MOPD cycles (“Iter-2 teachers”) further enhance student performance without catastrophic forgetting (Ma et al., 29 Jun 2026).

In summary, Multi-teacher Domain-routed On-Policy Distillation unifies dense, on-policy student learning from multiple, domain-specific RL experts with rigorous domain-routing and reward-weighting mechanisms, yielding scalable capability integration across LLMs and diffusion models and supporting teacher-surpassing generalization (Yang et al., 12 Feb 2026, Ma et al., 29 Jun 2026, Wang et al., 2 May 2026, Fang et al., 8 May 2026).