TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

Published 28 Jan 2025 in cs.LG, cs.AI, and cs.CL | (2501.16937v4)

Abstract: Causal LLMs have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce $\textit{Temporally Adaptive Interpolated Distillation (TAID)}$, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student's initial distribution towards the teacher's distribution. We provide a theoretical analysis demonstrating TAID's ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID's practical impact by developing two state-of-the-art compact foundation models: $\texttt{TAID-LLM-1.5B}$ for language tasks and $\texttt{TAID-VLM-2B}$ for vision-language tasks. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a novel knowledge distillation approach that uses a time-dependent interpolation between the student and teacher outputs to mitigate mode collapse.
It employs a momentum-based adaptive update for the interpolation parameter, yielding performance gains of up to 17.7% across various benchmarks.
Experimental results in language and vision settings demonstrate that TAID consistently outperforms conventional methods and stabilizes training dynamics.

The paper presents a novel knowledge distillation approach for LLMs that overcomes the well‐known issues arising from drastic capacity gaps and the conflicting dynamics of mode averaging versus mode collapse. The proposed method, Temporally Adaptive Interpolated Distillation (TAID), leverages a real‐time, adaptive intermediate distribution to gradually transition the target from the student’s own output to that of a high‐capacity teacher. The following summary details the technical contributions and key findings.

Key Innovations and Methodology

TAID reformulates the typical KD objective by introducing a time-dependent intermediate teacher distribution defined as

$p_t(y_s \mid y^{<s}) = \operatorname{softmax}\Big((1-t)\,\mathrm{logit}_{q_\theta'}(y_s \mid y^{<s}) + t\,\mathrm{logit}_p(y_s \mid y^{<s})\Big),$

where:

$t \in [0, 1]$ is a scheduling parameter that evolves throughout training,
$\mathrm{logit}_{q_\theta'}$ is a detached version of the student’s logits (ensuring that the intermediate target is not backpropagated), and
$\mathrm{logit}_p$ are the teacher logits.

This formulation permits the teacher’s richer, high–capacity distribution to gradually dominate as training progresses, thereby mitigating abrupt shifts that commonly lead to mode collapse or oversmoothing in conventional KD.

In addition, TAID incorporates an adaptive update strategy for the interpolation parameter $t$ . The update leverages the relative change in the TAID objective through a momentum-based mechanism. This ensures that in early stages—where the student’s predictions are closer to its own distribution—the parameter can be increased aggressively, and as the student nears the teacher’s performance, the update becomes more gradual. The adaptive update is formalized via the momentum variable $m_n = \beta m_{n-1} + (1-\beta)\delta_n,$ and the subsequent update $t_{n+1} \leftarrow \min\Big(t_\text{end},\,\max\Big(t_\text{linear},\,t_n + \alpha\,\text{sigmoid}(m_n)\Big)\Big),$ which balances stability and responsiveness.

Theoretical Analysis

To theoretically ground the method, the authors analyze TAID using a regression model as a surrogate for the language modeling objective. By framing the KD objective in the context of least–squares regression with an interpolation target, they show that TAID avoids the pitfalls of self–distillation. In self–distillation, the recursive use of the student’s predictions as targets can lead to collapse when the capacity gap is significant. In contrast, by injecting a fixed, strong teacher signal into the intermediate target, TAID guarantees the persistence of meaningful, non–collapsed modes. The authors demonstrate that if the norm of the teacher signal satisfies

$\|\mathbf{y}_0\| = \Omega\left(\sqrt{T\epsilon}\right),$

for a total of $T$ distillation steps, then the prediction dynamics remain nontrivial for all steps. This result is formalized through a comparison against bounds established for self–distillation, illustrating that TAID can safely traverse later phases of training without succumbing to collapse.

Empirical Analysis and Experimental Results

TAID is evaluated extensively across both instruction tuning and pre-training scenarios for LLMs, as well as in experiments extended to image classification tasks.

Instruction Tuning:
- The method is compared with several alternatives—such as standard KL divergence, Reverse KL, Total Variation Distance, Adaptive KL, GKD, DistiLLM, CTKD, and DKD—using diverse teacher-student pairs.
- In all configurations, TAID consistently yields higher MT–Bench scores, with improvements that range from modest (when the student–teacher gap is small) to significant when the capacity gap is large.
- Ablation studies demonstrate an increase in performance (ranging approximately 2.2% to 17.7%) when using the adaptive update for $t$ , underscoring its contribution to learning dynamics.
Pre-training Scenarios:
- Experiments conducted on a substantial corpus (covering approximately 20 billion tokens) show that TAID outperforms its baselines on established evaluation benchmarks (e.g., those following Open LLM Leaderboard methodologies).
- The results consistently indicate that TAID’s adaptive mechanism yields a more stable training loss with lower variance, aligning the training difficulty with the evolving student capabilities.
Capacity Gap and Mode Balance:
- A systematic study over varying teacher sizes reveals that while traditional methods (KL and RKL) exhibit inconsistent trends with increasing teacher capacity, TAID shows a monotonic improvement.
- Additional statistical analyses confirm that TAID maintains a balanced probability mass between high-frequency (“head”) and low-frequency (“tail”) tokens, thereby effectively mitigating both mode averaging and mode collapse.
Vision-Language and Image Classification Extensions:
- Beyond language tasks, experiments on image classification (CIFAR–100 and ImageNet) indicate that while TAID delivers modest gains on simpler tasks (CIFAR–100), its advantages become markedly clear on challenging distributions such as ImageNet.
- The paper also provides a detailed comparison with the Skew KL approach, emphasizing that TAID’s dynamic interpolation allows for a direct, adaptive transfer of knowledge from the teacher to the student, rather than an indirect and fixed mixture.

Applications to State-of-the-Art Model Development

The practical impact of TAID is further demonstrated by the development of two new state-of-the-art compact models:

TAID-LLM-1.5B:
- A LLM under 2B parameters that outperforms its contemporaries on LightEval benchmarks.
- The model achieves superior performance by effectively transferring knowledge from much larger teacher models without incurring the computational cost associated with self-generated outputs.
TAID-VLM-2B:
- A vision-LLM that surpasses existing competitors (even some with larger parameter counts) on tasks from the Open VLM Leaderboard.
- This cross-modal application highlights TAID’s versatility in handling diverse output distributions beyond traditional language tasks.

Conclusion

By reframing the traditional knowledge distillation objective using a temporally adaptive interpolation mechanism, TAID successfully bridges the capacity gap between teacher and student networks while preventing detrimental mode collapses. The comprehensive theoretical analysis, supported by rigorous empirical validations across multiple tasks and model architectures, establishes TAID as a robust and efficient distillation technique for producing high–quality, compact language and vision–LLMs.

Markdown Report Issue