Emergent Mind

A Survey on Transformer Compression

Published Feb 5, 2024 in cs.LG , cs.CL , cs.CV and


Large models based on the Transformer architecture play increasingly vital roles in artificial intelligence, particularly within the realms of natural language processing (NLP) and computer vision (CV). Model compression methods reduce their memory and computational cost, which is a necessary step to implement the transformer models on practical devices. Given the unique architecture of transformer, featuring alternative attention and Feedforward Neural Network (FFN) modules, specific compression techniques are required. The efficiency of these compression methods is also paramount, as it is usually impractical to retrain large models on the entire training dataset.This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to transformer models. The compression methods are primarily categorized into pruning, quantization, knowledge distillation, and efficient architecture design. In each category, we discuss compression methods for both CV and NLP tasks, highlighting common underlying principles. At last, we delve into the relation between various compression methods, and discuss the further directions in this domain.


  • This paper surveys recent advancements in Transformer model compression, focusing on methods like pruning, quantization, knowledge distillation, and efficient architecture design.

  • It discusses various techniques within each method, such as post-training quantization (PTQ), quantization-aware training (QAT), logits-based and hint-based knowledge distillation, and architecture innovations to reduce computational complexity.

  • The paper highlights the intertwined relationship between different compression strategies and the importance of combining techniques for optimal compression without sacrificing performance.

  • It also points out the future direction of exploring efficient architectures beyond traditional Transformers, such as RWKV and RetNet, to improve computational efficiency while maintaining or enhancing performance.

Introduction to Transformer Model Compression

Large Transformer-based models have achieved remarkable success across various domains like NLP and computer vision (CV), underpinned by their unique architecture comprising alternating attention and Feedforward Neural Network (FFN) modules. However, their extensive parameter count and computational demands necessitate efficient compression techniques to facilitate deployment on resource-constrained devices. This survey delves into recent advancements in Transformer model compression, examining methods including pruning, quantization, knowledge distillation, and efficient architecture design.

Compression Techniques Explored


Quantization aims to reduce model size and accelerate inference by representing model parameters and activations with lower bit-widths. Techniques span post-training quantization (PTQ) and quantization-aware training (QAT), enabling models to retain performance despite reduced precision. Advanced strategies involve adaptive and learned quantization schemes for both weights and activations, acknowledging the complexity and varying sensitivity across Transformer layers.

Knowledge Distillation

Knowledge distillation (KD) involves training a compact "student" model to mimic the behavior of a larger "teacher" model. Recent approaches in Transformers include logits-based KD, where the student learns from the teacher’s output logits, and hint-based KD, utilizing intermediate representations. Certain studies also explore API-based and adversarial KD techniques, particularly useful for distilling knowledge from LLMs available through APIs.


Pruning techniques refine the model structure by removing redundant parameters or neural network components, such as attention heads or FFN layers. A variety of granularities, criteria for pruning, and learning-based pruning strategies emerge, with certain methods focused on context and token pruning to address input sequence length and computational complexity in Transformers.

Efficient Architecture Design

Innovations in Transformer architecture aim to mitigate the quadratic computational complexity of attention mechanisms. Enhanced local information processing, streamlined attention mechanisms, and alternative modules to replace attention mechanisms are under exploration. Hierarchical and pyramid structures, attention simplification, and leveraging architectures like Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) offer pathways to design more computationally efficient Transformers.

Implications and Future Directions

The analysis underscores a nuanced relationship between different compression methods, highlighting potential synergies in adopting a combination of techniques for extreme compression. Addressing the training cost associated with these methods, especially for large models, emerges as a pivotal concern, prompting the need for training-efficient strategies.

Besides refining existing Transformer models, there's a burgeoning interest in exploring efficient architectures beyond the traditional Transformer design. Architectures like RWKV and RetNet, reducing computational complexity while maintaining or even enhancing model performance, pave the way for future innovations.


This survey encapsulates a wealth of strategies aimed at compressing Transformer models, offering insights into the methodological intricacies, practical implications, and prospective frontiers in model compression. As Transformers continue to dominate the landscape of machine learning models across numerous applications, efforts to enhance their computational efficiency remain crucial. The exploration of efficient architectures beyond traditional designs holds promise, potentially ushering in a new era of resource-aware, scalable machine learning models.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.

A Survey on Transformer Compression (1 point, 0 comments)
A Survey on Transformer Compression (11 points, 1 comment) in /r/LearningMachines