SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models (2410.19503v2)

Published 25 Oct 2024 in cs.CL

Abstract: Despite the success of LLMs, they still face challenges related to high inference costs and memory requirements. To address these issues, Knowledge Distillation (KD) has emerged as a popular method for model compression, with student-generated outputs (SGOs) as training data being particularly notable for reducing the mismatch between training and inference. However, SGOs often produce noisy and biased sequences, which can lead to misguidance from the teacher model, especially in long sequences. To mitigate these challenges, we propose SWITCH (Studying WIth TeaCHer for Knowledge Distillation), a novel approach that strategically incorporates the teacher model during the student's sequence generation. SWITCH identifies discrepancies between the token probabilities of the teacher and student models, allowing the teacher to intervene selectively, particularly in long sequences that are more prone to teacher misguidance. Extensive experimental results across three model families and five instruction-following datasets show that SWITCH surpasses traditional KD methods, particularly excelling in the generation of long sequential data.

Summary

The paper introduces SWITCH, a novel KD approach that employs selective teacher intervention using Jensen-Shannon Divergence to reduce error accumulation.
The methodology uses an exponentially decaying threshold to optimize teacher involvement, improving long sequence generation and maintaining Rouge-L scores.
Experimental results on models like GPT-2, OPT, and OpenLLaMA2 show that SWITCH outperforms traditional SGO-based methods for efficient model compression.

Overview of "SWITCH: Studying with Teacher for Knowledge Distillation of LLMs"

The paper "SWITCH: Studying with Teacher for Knowledge Distillation of LLMs" introduces a novel approach to enhancing the efficiency of Knowledge Distillation (KD) processes in the context of LLMs. Given the considerable computational costs associated with LLMs, model compression via KD is an attractive area of research. The SWITCH method addresses challenges inherent in KD, particularly those arising from reliance on Student-Generated Outputs (SGOs), which are prone to noise and bias.

Context and Motivation

LLMs have demonstrated impressive capabilities, but their deployment is often constrained by significant resource demands. KD is a prominent method for reducing model size while attempting to maintain performance levels. Traditional KD methods have been extended from natural language understanding to generation tasks. In the context of text generation, SGOs are used to close the gap between the training phase and inference. However, these outputs can misguide the teacher model, especially in longer sequences where errors accumulate due to the autoregressive nature of the models.

Methodology

SWITCH leverages a strategic incorporation of the teacher model during generation processes of the student model's sequences. The approach identifies discrepancies in token probabilities between the student and teacher models, allowing selective intervention by the teacher. This is particularly useful in long sequences, where the risk of cumulative errors is greater.

Key Components

Selective Intervention: SWITCH employs Jensen-Shannon Divergence (JSD) to measure the difference in distributions between student and teacher models. If the divergence exceeds a defined threshold, the token is generated using the teacher model.
Exponentially Decaying Threshold: To mitigate accumulated bias in lengthy sequences, SWITCH increases the teacher's involvement as more tokens are generated, employing a decaying threshold to modulate this interaction efficiently.

Experimental Results

The experimental setup involved testing across different model families (GPT-2, OPT, and OpenLLaMA2) and datasets. SWITCH consistently outperformed baseline approaches, showcasing notable performance improvements in generating long sequential data. The authors observed that performance gains were more substantial with a greater disparity in model sizes between student and teacher models, illustrating effective mitigation of misguidance.

Numerical Highlights

The performance of SWITCH was validated over five instruction-following benchmarks and remained robust across varying student model sizes. The method demonstrated enhanced efficacy in maintaining Rouge-L scores in longer sequence generations compared to traditional KD and SGO-based methods.

Implications and Future Work

SWITCH presents a significant advancement in the ability to compress LLMs while retaining their performance, especially in lengthy generative tasks. By resolving issues of misguidance through strategic teacher interventions, this method not only improves model efficiency but also opens pathways for deploying smaller, faster models in contexts where resource constraints are critical.

Future research might explore the application of SWITCH in other types of models and tasks, as well as the potential integration with other model compression techniques. The robustness and flexibility of the method across various loss functions suggest promising extensions into novel training regimes for further efficiency and scalability improvements in AI systems.

In conclusion, the SWITCH approach contributes a valuable method to the field of model compression, enhancing the effective deployment of smaller LLMs and providing a framework that could influence future developments in AI model training and deployment strategies.

PDF Markdown

Tweets

https://twitter.com/gm8xx8/status/1850743518131519683

https://twitter.com/RimCook/status/1884443517973418133