Importance-Weighted On-Policy Distillation

Updated 4 July 2026

IW-OPD is a method that assigns non-uniform weights to student-generated trajectories to reallocate teacher feedback based on factors like accumulated discrepancy and reward signals.
The technique uses weight sources such as teacher–student log-ratios, confidence scores, and entropy to stabilize updates and overcome issues from long horizon rollouts.
Empirical studies show that adaptive weighting schemes improve performance over standard OPD by effectively mitigating the degradation of teacher supervision in extended sequences.

Importance-Weighted On-Policy Distillation (IW-OPD) denotes a family of on-policy distillation methods in which dense teacher-derived supervision on student-generated trajectories is modulated by non-uniform weights rather than being uniformly averaged. In the narrow sense used by “On the Position Bias of On-Policy Distillation,” IW-OPD is a specific correction for the degradation of teacher supervision quality along long student rollouts, where earlier tokens are upweighted and later, more drifted tokens are downweighted (Xie et al., 21 Jun 2026). In a broader 2026 usage, surveys describe IW-OPD not as a single loss family but as an allocation rule over student-induced states, with weights defined by accumulated teacher–student discrepancy, reward tilting, signed advantages, confidence, entropy, trajectory quality, or attribution signals (Song et al., 1 Apr 2026, Zhang, 22 Jun 2026).

1. Problem setting and motivation

On-Policy Distillation (OPD) trains a student policy on its own trajectories while querying a teacher on the visited prefixes. In the standard sequence-level reverse-KL form, one writes

$\mathcal{J}_{\mathrm{OPD}}(\theta) = -\mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot \mid x)} \left[ \sum_{t=1}^T \log \frac{\pi_\theta(y_t \mid x, y_{<t})}{\pi_T(y_t \mid x, y_{<t})} \right],$

with token-level advantage

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$

This basic formulation already mitigates exposure bias by training on learner-visited states rather than static teacher-forced data (Xie et al., 21 Jun 2026, Song et al., 1 Apr 2026).

The motivation for importance weighting is that uniform token averaging assumes that all visited positions provide supervision of comparable quality. The position-bias analysis shows that this assumption fails systematically: as student rollouts become longer, prefixes deviate further from the teacher’s distribution, teacher supervision degrades at later positions, and “OPD using only the first 30% of tokens can perform comparably to using all tokens, whereas OPD using only the last 30% of tokens barely learns anything” (Xie et al., 21 Jun 2026). More generally, several later formulations argue that not all trajectories, tokens, or teachers are equally informative. Incorrect student-generated outputs may carry more useful correction signal than correct ones; high-entropy student positions may deserve more attention than routine tokens; and some privileged teachers or peers may be more reliable on a given state than others (Lin et al., 22 Jun 2026, Li et al., 1 Jun 2026, Yu et al., 29 Jun 2026).

This motivates a shift from uniform OPD to weighted OPD, in which learning signal is redistributed toward positions, trajectories, or supervision sources deemed more reliable or more action-relevant.

2. Mathematical formulations

A general formulation from the feedback-to-update view treats IW-OPD as OPD with explicit token- or step-level weights. In the direct-loss route,

$\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$

where $c_t$ is a student-induced prefix, $q$ is the teacher or self-teacher signal, $D_{\Omega_t}$ is a local divergence on a declared support $\Omega_t$ , and $w_t$ is the importance weight. In the policy-gradient route,

$g_{\mathrm{IW\text{-}OPD}} = \mathbb{E}_{\tau \sim \pi_\theta} \Bigg[ \sum_{t=1}^T w_t A_t \nabla_{\theta}\log\pi_\theta(y_t\mid c_t) \Bigg].$

These two routes differ by differentiation path rather than by an essential difference in supervision source (Zhang, 22 Jun 2026).

The narrow IW-OPD derivation in the position-bias paper starts from a local trust-region projection

$\min_{q} \; D_{\mathrm{KL}}(q \,\|\, \pi_T) \quad\text{s.t.}\quad D_{\mathrm{KL}}(q \,\|\, \pi_\theta) \le \rho,$

whose optimum is

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 0

This yields causal prefix weights based on the teacher-to-student likelihood ratio over the sampled prefix,

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 1

and motivates token reweighting by accumulated prefix discrepancy (Xie et al., 21 Jun 2026).

Because raw ratio products are high variance, the practical surrogate replaces them by normalized cumulative discrepancy. The final blend weight is

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 2

and the weighted token advantage becomes

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 3

The stop-gradient is essential in that construction (Xie et al., 21 Jun 2026).

A second major mathematical route appears in reward-regularized self-distillation. There the target is not the teacher distribution itself but a reward-reweighted teacher

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 4

with weights

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 5

In that view, importance weighting operationalizes reward-aware reallocation within teacher support rather than prefix correction alone (Yu et al., 6 May 2026).

3. Sources of importance weights

The literature uses “importance” in several distinct senses. Some papers use explicit probability ratios; others use advantage weights, confidence scores, entropy-derived proxies, or value attribution signals. The common structure is non-uniform allocation of OPD signal.

Representative line	Weight source	Characteristic target
Position-bias IW-OPD (Xie et al., 21 Jun 2026)	Accumulated teacher–student prefix discrepancy	Upweight earlier tokens, downweight later drifted tokens
Reward-regularized self-distillation / PBSD (Yu et al., 6 May 2026)	$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 6	Distill to a reward-reweighted teacher
OISD (Liu et al., 27 May 2026)	Signed clipped GRPO advantages $A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 7	Advantage-weighted JS alignment of logits and attention
ReNIO and FiRe-OPD (Lin et al., 22 Jun 2026, Li et al., 1 Jun 2026)	Sample weights from selected log-ratios; token weights from teacher confidence and student confusion after trajectory filtering	Emphasize likely negative trajectories and informative tokens
DOPD, MAD-OPD, OPD-DA, OPD-Evolver (Yu et al., 29 Jun 2026, Wang et al., 2 May 2026, Yu et al., 2024, Zhang et al., 16 Jun 2026)	Advantage/confidence routing, debate confidence, attention over peers, outcome-calibrated attribution	Weight supervision sources, token types, or lifecycle decisions
DiffusionOPD extension (Li et al., 14 May 2026)	State-density or transition-level ratios $A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 8 in continuous-state diffusion	Correct teacher–student state-distribution mismatch when desired

These variants are not interchangeable. In some formulations the weight modulates tokens within a single trajectory; in others it weights whole trajectories, chooses among multiple teachers, or adjusts supervision on latent or continuous-state transitions. This suggests that IW-OPD is better understood as a design pattern than as a single canonical algorithm.

Several papers also distinguish between “teacher matching” and “teacher reweighting.” PBSD explicitly argues that plain KL matching uniformly imitates the teacher, whereas reward-aware distillation should target a reward-tilted teacher distribution (Yu et al., 6 May 2026). ReNIO, by contrast, uses the student-to-teacher probability ratio to detect pivotal tokens in wrong reasoning traces and aggregates them into a normalized sample weight, thereby emphasizing likely negative trajectories without observing final-answer correctness (Lin et al., 22 Jun 2026). FiRe-OPD decomposes the weighting problem into hard trajectory filtering and soft token reweighting, arguing that low-quality rollouts should be removed entirely while token informativeness should be adjusted continuously (Li et al., 1 Jun 2026).

Not all papers presenting weighted OPD actually require explicit importance sampling. DiffusionOPD is explicit that the original method “does not use importance weighting” because the per-step KL is already defined under student rollouts and exactly computable at student-visited states; the proposed IW-OPD variant is an extension for correcting distribution mismatch rather than a component of the original method (Li et al., 14 May 2026).

4. Stabilization, routing, and algorithmic structure

The central practical issue in IW-OPD is variance. Multiple papers identify heavy-tailed updates when raw teacher–student ratios or log-ratios are used too directly. Asymmetric On-Policy Distillation notes that the sampled-token log-ratio

$A_t^{\mathrm{OPD}} = \log \pi_T(y_t \mid x, y_{<t}) - \log \pi_\theta(y_t \mid x, y_{<t}).$ 9

has a broad negative tail, and that using $\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$ 0 as a weight would exacerbate heavy-tailed updates, especially when the student assigns near-zero probability to teacher-favored tokens (Jia et al., 7 May 2026). The position-bias paper reaches a similar conclusion from a different direction: the ideal likelihood-ratio weight alone is unstable, while the practical cumulative-share surrogate plus OPD blend is effective (Xie et al., 21 Jun 2026).

One stabilization family therefore clips, tempers, or normalizes weights. DiffusionOPD’s proposed IW extension recommends stop-gradient weights or weights computed with a lagged student, together with clipping, self-normalization, or tempering $\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$ 1 to control variance (Li et al., 14 May 2026). ReNIO clips selected token log-ratios, aggregates them by a geometric mean, and batch-normalizes the resulting sample weights; removing batch normalization hurts most in ablations (Lin et al., 22 Jun 2026). FiRe-OPD normalizes token weights within each trajectory so that the weighted advantage preserves gradient scale (Li et al., 1 Jun 2026).

A second stabilization family changes the update route rather than merely rescaling it. REOPOLD interprets OPD as policy optimization with token reward $\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$ 2, then clips that reward using a mixture-derived lower bound,

$\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$ 3

and combines this with entropy-based token-level dynamic sampling and a two-phase exploration-to-refinement curriculum (Ko et al., 11 Mar 2026). The survey literature separates temporal credit from vocabulary routing and argues that negative feedback on the sampled token does not, by itself, specify which teacher-supported alternative should receive probability mass. That distinction motivates GAE-OPD for temporal credit and Counterfactual Routed OPD for explicit probability routing on a declared support (Zhang, 22 Jun 2026).

A third response is to replace risky scalar-weighted negative reinforcement with bounded distributional guidance. AOPD routes positive-advantage positions through standard OPD but replaces non-positive regions with teacher-centered forward KL on top- $\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$ 4 support, thereby reducing variance, avoiding vanishing gradients in neutral regions, and addressing the “exploration black hole” in which the correct unsampled alternative receives negligible direct support (Jia et al., 7 May 2026). This does not reject importance weighting in principle, but it makes explicit that naive weighting can be inferior to structured divergence minimization in adverse regions.

Across these formulations, a recurring algorithmic template emerges: sample on-policy trajectories; compute teacher or self-teacher feedback on visited prefixes; derive weights from discrepancy, reward, entropy, confidence, or attribution; stabilize them with clipping, normalization, or stop-gradient; and apply them either to direct distributional losses, to policy-gradient-style actor terms, or to both.

5. Empirical profile

The empirical record is heterogeneous because the literature studies different notions of “importance,” but several patterns recur: weights help most under strong teacher–student mismatch, long horizons, or sparse high-value supervision; trajectory-adaptive or state-adaptive schemes outperform fixed schedules; and stabilization is usually decisive.

Setting	Method	Reported result
Position bias in LLM OPD	Trajectory-adaptive cumulative-share IW-OPD	AIME-2025: 48.9 vs 43.3 for standard OPD; step-10 AIME-2025: 49.3 vs 42.4; cross-scale final average: 57.1 vs 55.3 (Xie et al., 21 Jun 2026)
Negative-trajectory reweighting	ReNIO	Qwen3-1.7B OPSD math Avg: 42.78 vs 40.83; OPD math Avg: 42.04 vs 40.37 (Lin et al., 22 Jun 2026)
Filter-then-reweight	FiRe-OPD	Strong-to-Weak Avg@8: 60.83 vs 58.70 for OPD; Multi-Teacher code: 64.16 vs 59.79; +18.81 on MinervaMATH in multi-teacher (Li et al., 1 Jun 2026)
Dual routed supervision	DOPD	LLM average: 51.4 vs 43.9 for Vanilla OPD; VLM average: 58.4 vs 52.4 (Yu et al., 29 Jun 2026)
Debate-weighted multi-teacher OPD	MAD-OPD	In the 14B+8B $\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$ 54B setting, Ag-Avg 25.69 vs 23.26 and Co-Avg 48.12 vs 44.41 over stronger single-teacher OPD (Wang et al., 2 May 2026)
Internal signed-advantage weighting	OISD	Qwen3-4B average: 64.75 vs 55.08 for GRPO (Liu et al., 27 May 2026)

The position-bias results are especially central because they directly instantiate the term IW-OPD. They show that merely preferring earlier positions is not enough: “Amplify fixed prefix ratio” reaches 43.7, “Linear decay” 44.1, “Manual curriculum” 44.8, while the trajectory-adaptive cumulative-share method reaches 48.9 on AIME-2025 (Xie et al., 21 Jun 2026). The same paper also reports that smaller students benefit more and that stronger teachers become more sample-efficient once weighting corrects early-stage mismatch.

Other weighted formulations support the broader claim that selective allocation of OPD signal improves training. ReNIO reports that incorrect-only training outperforms correct-only training in controlled filtering experiments and that its ratio-based weighting improves both OPD and OPSD on math and code (Lin et al., 22 Jun 2026). FiRe-OPD finds that hard trajectory filtering plus soft token reweighting outperforms hard–hard, soft–soft, and soft–hard alternatives, which suggests that optimization granularity matters: trajectory-level noise removal and token-level proportional emphasis are complementary rather than redundant (Li et al., 1 Jun 2026).

In nonstandard OPD settings the same pattern persists. DOPD reports that advantage-aware dual routing “consistently outperforms Vanilla OPD and other counterparts” in LLM and VLM settings, with large gains in high-mismatch configurations (Yu et al., 29 Jun 2026). MAD-OPD shows that confidence-weighted aggregation after multi-agent debate can beat stronger single-teacher OPD in both agentic and code tasks (Wang et al., 2 May 2026). DiffusionOPD, although not itself an IW method in the original paper, reports that closed-form on-policy distillation surpasses both multi-reward RL and cascade RL and that ODE can yield up to $\mathcal{L}_{\mathrm{IW\text{-}direct}} = \mathbb{E}_{c_t \sim d_{\pi_\theta}} \left[ w_t \, D_{\Omega_t}\!\left(\pi_\theta(\cdot \mid c_t), q(\cdot \mid c_t)\right) \right],$ 6 efficiency gains compared to SDE with moderate noise, which indicates that weighting questions also arise in continuous-state generative dynamics (Li et al., 14 May 2026).

6. Limits, ambiguities, and research directions

A persistent conceptual ambiguity is that “importance weighting” may refer either to unbiased correction of a sampling mismatch or to an intentional reallocation of learning signal. The surveys explicitly note that in strictly on-policy minibatches, off-policy correction is often unnecessary, and many OPD weights are instead allocation weights derived from log-ratios, entropy, verifier signals, or routing rules (Song et al., 1 Apr 2026, Zhang, 22 Jun 2026). This distinction matters because an allocation weight can be useful even when it is not an importance ratio in the classical Monte Carlo sense.

A second limitation is that weighting is not universally necessary or universally helpful. DiffusionOPD’s original objective already matches teacher transitions exactly at student-visited states and therefore “no explicit importance sampling is used or needed” (Li et al., 14 May 2026). AOPD further argues that naive importance weighting can be actively harmful in negative-advantage regions because it amplifies precisely the heavy-tailed updates that destabilize OPD (Jia et al., 7 May 2026). This suggests that the right comparison is often not weighted versus unweighted OPD, but weighted scalar reinforcement versus bounded distributional or routed correction.

The main technical failure modes are repeated across the literature: severe teacher–student mismatch and long horizons can exacerbate off-policy gaps; importance weights can have large variance; sparse reward teachers or brittle experts can limit guidance quality; privileged context may be weak or misleading; memory attribution may be noisy; and layer or support selection may be underexplored (Li et al., 14 May 2026, Yu et al., 6 May 2026, Zhang et al., 16 Jun 2026, Liu et al., 27 May 2026). The surveys add broader unresolved questions: distillation scaling laws, uncertainty-aware weighting, agent-level distillation, cross-tokenizer and latent-space IW, robust weighting under distribution shift, and principled integration of distillation with reward-guided learning (Song et al., 1 Apr 2026).

A plausible implication is that future IW-OPD work will become more explicit about three separations that the current literature sometimes conflates: temporal credit versus vocabulary routing, correction versus allocation, and local token weighting versus higher-level weighting over trajectories, teachers, or latent modules. The recent formula-driven agenda already frames those as distinct design variables rather than a single knob (Zhang, 22 Jun 2026). Under that view, IW-OPD is less a fixed algorithm than a unifying principle for deciding where dense on-policy teacher feedback should matter most.