Policy-Label Divergence (PLD) Overview
- Policy-Label Divergence (PLD) is a metric defined by the KL divergence between static training labels and evolving model policies, highlighting supervision mismatches.
- PLD diagnoses catastrophic forgetting and mode collapse by measuring how fine-tuning shifts output distributions away from oracle references.
- Trajectory-Mixed Supervision (TMS) utilizes PLD insights to blend static and dynamic supervision, thereby preserving model capabilities without expensive RL pipelines.
Policy-Label Divergence (PLD) is a formal metric that quantifies the distributional mismatch between a learning agent’s evolving policy and the fixed supervision provided by training labels. PLD is central in the diagnosis of catastrophic forgetting and mode collapse during supervised fine-tuning (SFT) of large models, particularly when the solution space is inherently multi-modal. Mitigation of PLD underlies the design of recent reward-free, on-policy supervision strategies, such as Trajectory-Mixed Supervision (TMS), which aim to preserve retention of broad model capabilities without expensive reinforcement learning (RL) pipelines (Khan et al., 3 Feb 2026).
1. Formal Definition and Theoretical Basis
PLD is mathematically defined as the forward Kullback-Leibler (KL) divergence between the static label distribution and the current model policy , averaged over the input distribution :
In standard single-reference SFT where each input is paired with a unique label , the oracle distribution is a Dirac delta at , and PLD reduces to the token-level cross-entropy (modulo constants). For practical monitoring, held-out PLD drift is estimated by the average negative log-likelihood (NLL) on a validation set:
This formulation captures the degree to which the model’s current conditional output distribution aligns with the static labels used for supervision (Khan et al., 3 Feb 2026).
2. PLD and Failure Modes in Supervised Fine-Tuning
PLD quantitatively encodes the supervision mismatch that arises when a model’s output distribution diverges from empirical labels during fine-tuning. As SFT proceeds, the model often uncovers valid alternatives but single-reference cross-entropy continually refocuses probability on , causing sharp corrective gradients. This leads to:
- Catastrophic forgetting: retention scores on unrelated benchmarks degrade as the policy loses support for modes not covered by .
- Mode collapse: on tasks with non-unique solutions (e.g., multi-step math), the solution entropy contracts, resulting in poorer Pass@K or coverage metrics.
Empirical signatures include monotonic improvement on training split accuracy coupled with sharp loss in held-out NLL and unrelated task performance, reflecting an increase in PLD drift (Khan et al., 3 Feb 2026).
3. PLD as a Predictor and Bound for Forgetting
A rigorous upper bound links increases in PLD to the magnitude of catastrophic forgetting. Under a bounded scoring function :
This implies that large KL drift between the current and base policy permits correspondingly large changes in expected scores—a mechanistic link between PLD drift and observed retention loss. In practice, SFT accumulates considerable PLD drift and high negative forgetting, whereas on-policy RL and TMS maintain low PLD drift and commensurately preserve retention (Khan et al., 3 Feb 2026).
4. Minimizing PLD with Trajectory-Mixed Supervision
TMS provides a practical, reward-free approach to reduce PLD during supervised fine-tuning. Instead of training exclusively on static oracle references, TMS constructs a dynamic supervision set by harvesting trajectories from a sequence of model checkpoints along the SFT path. At training time, the supervision for each input is drawn as a mixture: with probability from the oracle reference and with probability from checkpoints. This recouples the supervision distribution to the evolving policy, ensuring the student model produces output distributions that are closer (in the PLD sense) to what the current or recent models likely generate.
Algorithmically, the TMS process consists of:
- Trajectory harvesting: Save checkpoints along the SFT path; cache model-generated outputs for each input at each checkpoint.
- Student training: For each sample, the target is drawn from the mixed label-buffer comprising references and model-generated outputs (controlled by ), and standard cross-entropy is minimized.
Experiments demonstrate that TMS maintains in-domain accuracy gains while drastically reducing PLD drift and catastrophic forgetting compared to standard SFT, approaching the behavior of on-policy RL methods but without reward/verification signals (Khan et al., 3 Feb 2026).
5. PLD in Broader Sequential Decision-Making
The concept of PLD generalizes beyond supervised LLM fine-tuning to sequential prediction and control domains. In Masked Trajectory Models (MTM), for example, policy-label divergence is controlled by training a single Transformer-based network to reconstruct masked segments of trajectories with random masking patterns. By enabling the network to condition on a wide array of input-output distributions, MTM regularizes the conditional modeling and limits mode collapse observed in more specialized architectures. This approach, akin to TMS, implicitly maintains low PLD by virtue of its mixed-modality training process (Wu et al., 2023).
6. Practical Monitoring and Experimental Findings
PLD drift is operationalized in practice by continuous monitoring of validation NLL and by explicitly computing the KL divergence between candidate policies and a reference (often the initialization). In comprehensive experiments on LLMs (Qwen-3B, LLaMA-8B) and across tasks such as reasoning (MATH, GSM8K), the correlation between held-out PLD and forgetting scores is empirically validated. PLD-minimizing frameworks like TMS consistently yield Pareto improvements in the accuracy-retention tradeoff, recovering most of the capability loss otherwise observed in static-label SFT (Khan et al., 3 Feb 2026).
| Method | Target Accuracy (Math500) | Retention (ARC-Challenger) | PLD Drift (Val NLL trend) |
|---|---|---|---|
| SFT | 76.4% | 42.7% | High |
| RL (GRPO) | 77.4% | 80.2% | Low |
| TMS (PLD-min) | 77.8% | 79.2% | Low |
PLD-monitoring is thus a practical diagnostic for early warning of retention hazards and provides actionable guidance for curriculum design in both language and trajectory modeling domains.
7. Implications and Limitations
By formalizing the supervision mismatch as PLD, research practitioners gain a theoretically principled and empirically corroborated lens on the mechanisms underlying catastrophic forgetting and retention loss. However, computation of full-distribution PLD is tractable only in discrete, modest output spaces or via approximations (e.g., surrogate distributions, NLL). While TMS and MTM architectures mitigate PLD by recoupling supervision to the evolving policy, their efficacy depends on checkpoint selection and coverage of the supervised solution space. The paradigm highlights a tradeoff: static SFT improves speed and efficiency but risks severe PLD-driven collapse; on-policy RL improves retention at high computational cost; TMS offers an intermediate, reward-free path to controlling PLD with modest complexity (Khan et al., 3 Feb 2026, Wu et al., 2023).