This paper surveys recent advancements in Transformer model compression, focusing on methods like pruning, quantization, knowledge distillation, and efficient architecture design.
It discusses various techniques within each method, such as post-training quantization (PTQ), quantization-aware training (QAT), logits-based and hint-based knowledge distillation, and architecture innovations to reduce computational complexity.
The paper highlights the intertwined relationship between different compression strategies and the importance of combining techniques for optimal compression without sacrificing performance.
It also points out the future direction of exploring efficient architectures beyond traditional Transformers, such as RWKV and RetNet, to improve computational efficiency while maintaining or enhancing performance.
Large Transformer-based models have achieved remarkable success across various domains like NLP and computer vision (CV), underpinned by their unique architecture comprising alternating attention and Feedforward Neural Network (FFN) modules. However, their extensive parameter count and computational demands necessitate efficient compression techniques to facilitate deployment on resource-constrained devices. This survey delves into recent advancements in Transformer model compression, examining methods including pruning, quantization, knowledge distillation, and efficient architecture design.
Quantization aims to reduce model size and accelerate inference by representing model parameters and activations with lower bit-widths. Techniques span post-training quantization (PTQ) and quantization-aware training (QAT), enabling models to retain performance despite reduced precision. Advanced strategies involve adaptive and learned quantization schemes for both weights and activations, acknowledging the complexity and varying sensitivity across Transformer layers.
Knowledge distillation (KD) involves training a compact "student" model to mimic the behavior of a larger "teacher" model. Recent approaches in Transformers include logits-based KD, where the student learns from the teacher’s output logits, and hint-based KD, utilizing intermediate representations. Certain studies also explore API-based and adversarial KD techniques, particularly useful for distilling knowledge from LLMs available through APIs.
Pruning techniques refine the model structure by removing redundant parameters or neural network components, such as attention heads or FFN layers. A variety of granularities, criteria for pruning, and learning-based pruning strategies emerge, with certain methods focused on context and token pruning to address input sequence length and computational complexity in Transformers.
Innovations in Transformer architecture aim to mitigate the quadratic computational complexity of attention mechanisms. Enhanced local information processing, streamlined attention mechanisms, and alternative modules to replace attention mechanisms are under exploration. Hierarchical and pyramid structures, attention simplification, and leveraging architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) offer pathways to design more computationally efficient Transformers.
The analysis underscores a nuanced relationship between different compression methods, highlighting potential synergies in adopting a combination of techniques for extreme compression. Addressing the training cost associated with these methods, especially for large models, emerges as a pivotal concern, prompting the need for training-efficient strategies.
Besides refining existing Transformer models, there's a burgeoning interest in exploring efficient architectures beyond the traditional Transformer design. Architectures like RWKV and RetNet, reducing computational complexity while maintaining or even enhancing model performance, pave the way for future innovations.
This survey encapsulates a wealth of strategies aimed at compressing Transformer models, offering insights into the methodological intricacies, practical implications, and prospective frontiers in model compression. As Transformers continue to dominate the landscape of machine learning models across numerous applications, efforts to enhance their computational efficiency remain crucial. The exploration of efficient architectures beyond traditional designs holds promise, potentially ushering in a new era of resource-aware, scalable machine learning models.