Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Transformer Compression (2402.05964v2)

Published 5 Feb 2024 in cs.LG, cs.CL, and cs.CV
A Survey on Transformer Compression

Abstract: Transformer plays a vital role in the realms of NLP and computer vision (CV), specially for constructing LLMs (LLM) and large vision models (LVM). Model compression methods reduce the memory and computational cost of Transformer, which is a necessary step to implement large language/vision models on practical devices. Given the unique architecture of Transformer, featuring alternative attention and feedforward neural network (FFN) modules, specific compression techniques are usually required. The efficiency of these compression methods is also paramount, as retraining large models on the entire training dataset is usually impractical. This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design (Mamba, RetNet, RWKV, etc.). In each category, we discuss compression methods for both language and vision tasks, highlighting common underlying principles. Finally, we delve into the relation between various compression methods, and discuss further directions in this domain.

A Comprehensive Survey on Transformer Model Compression Techniques

Introduction to Transformer Model Compression

Large Transformer-based models have achieved remarkable success across various domains like NLP and computer vision (CV), underpinned by their unique architecture comprising alternating attention and Feedforward Neural Network (FFN) modules. However, their extensive parameter count and computational demands necessitate efficient compression techniques to facilitate deployment on resource-constrained devices. This survey explores recent advancements in Transformer model compression, examining methods including pruning, quantization, knowledge distillation, and efficient architecture design.

Compression Techniques Explored

Quantization

Quantization aims to reduce model size and accelerate inference by representing model parameters and activations with lower bit-widths. Techniques span post-training quantization (PTQ) and quantization-aware training (QAT), enabling models to retain performance despite reduced precision. Advanced strategies involve adaptive and learned quantization schemes for both weights and activations, acknowledging the complexity and varying sensitivity across Transformer layers.

Knowledge Distillation

Knowledge distillation (KD) involves training a compact "student" model to mimic the behavior of a larger "teacher" model. Recent approaches in Transformers include logits-based KD, where the student learns from the teacher’s output logits, and hint-based KD, utilizing intermediate representations. Certain studies also explore API-based and adversarial KD techniques, particularly useful for distilling knowledge from LLMs available through APIs.

Pruning

Pruning techniques refine the model structure by removing redundant parameters or neural network components, such as attention heads or FFN layers. A variety of granularities, criteria for pruning, and learning-based pruning strategies emerge, with certain methods focused on context and token pruning to address input sequence length and computational complexity in Transformers.

Efficient Architecture Design

Innovations in Transformer architecture aim to mitigate the quadratic computational complexity of attention mechanisms. Enhanced local information processing, streamlined attention mechanisms, and alternative modules to replace attention mechanisms are under exploration. Hierarchical and pyramid structures, attention simplification, and leveraging architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) offer pathways to design more computationally efficient Transformers.

Implications and Future Directions

The analysis underscores a nuanced relationship between different compression methods, highlighting potential synergies in adopting a combination of techniques for extreme compression. Addressing the training cost associated with these methods, especially for large models, emerges as a pivotal concern, prompting the need for training-efficient strategies.

Besides refining existing Transformer models, there's a burgeoning interest in exploring efficient architectures beyond the traditional Transformer design. Architectures like RWKV and RetNet, reducing computational complexity while maintaining or even enhancing model performance, pave the way for future innovations.

Conclusion

This survey encapsulates a wealth of strategies aimed at compressing Transformer models, offering insights into the methodological intricacies, practical implications, and prospective frontiers in model compression. As Transformers continue to dominate the landscape of machine learning models across numerous applications, efforts to enhance their computational efficiency remain crucial. The exploration of efficient architectures beyond traditional designs holds promise, potentially ushering in a new era of resource-aware, scalable machine learning models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yehui Tang (63 papers)
  2. Yunhe Wang (145 papers)
  3. Jianyuan Guo (40 papers)
  4. Zhijun Tu (32 papers)
  5. Kai Han (184 papers)
  6. Hailin Hu (16 papers)
  7. Dacheng Tao (826 papers)
Citations (16)

HackerNews

  1. A Survey on Transformer Compression (1 point, 0 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit

  1. A Survey on Transformer Compression (11 points, 1 comment)