DiffRate : Differentiable Compression Rate for Efficient Vision Transformers (2305.17997v1)

Published 29 May 2023 in cs.CV

Abstract: Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens. It is an important but challenging task. Although recent advanced approaches achieved great success, they need to carefully handcraft a compression rate (i.e. number of tokens to remove), which is tedious and leads to sub-optimal performance. To tackle this problem, we propose Differentiable Compression Rate (DiffRate), a novel token compression method that has several appealing properties prior arts do not have. First, DiffRate enables propagating the loss function's gradient onto the compression ratio, which is considered as a non-differentiable hyperparameter in previous work. In this case, different layers can automatically learn different compression rates layer-wisely without extra overhead. Second, token pruning and merging can be naturally performed simultaneously in DiffRate, while they were isolated in previous works. Third, extensive experiments demonstrate that DiffRate achieves state-of-the-art performance. For example, by applying the learned layer-wise compression rates to an off-the-shelf ViT-H (MAE) model, we achieve a 40% FLOPs reduction and a 1.5x throughput improvement, with a minor accuracy drop of 0.16% on ImageNet without fine-tuning, even outperforming previous methods with fine-tuning. Codes and models are available at https://github.com/OpenGVLab/DiffRate.

Authors (9)

Mengzhao Chen (19 papers)
Wenqi Shao (89 papers)
Peng Xu (357 papers)
Mingbao Lin (78 papers)
Kaipeng Zhang (73 papers)
Fei Chao (53 papers)
Rongrong Ji (315 papers)
Yu Qiao (563 papers)
Ping Luo (340 papers)

Citations (30)

View on Semantic Scholar

Summary

Differentiable Compression Rate for Efficient Vision Transformers

The paper "DiffRate: Differentiable Compression Rate for Efficient Vision Transformers" introduces a novel framework, DiffRate, designed to enhance the efficiency of Vision Transformers (ViTs) by advancing token compression techniques. The primary motivation is to address the computational burden ViTs pose due to their quadratic complexity with respect to input size, especially when applied to large-scale datasets.

Overview of DiffRate

Token compression, fundamentally, seeks to reduce the size of token inputs, which corresponds directly to the computation required in self-attention mechanisms within transformers. This paper critiques prior approaches that typically involve token pruning (removal of tokens deemed unimportant) or token merging (combining tokens with high semantic similarity) conducted with pre-defined, static compression rates. The limitation of these approaches is their reliance on hand-crafted rates, which can be sub-optimal and cumbersome when tuning for different layers.

DiffRate proposes a differentiable approach to compression rate determination, allowing for a more dynamic and optimized allocation of token reduction across the network. The compression rates in DiffRate are not defined a priori but are learned through the network's back-propagation process.

Key Contributions

Differentiable Compression Rate: DiffRate introduces the Differentiable Discrete Proxy (DDP) module, which allows token pruning and merging rates to be learned as part of the network's training. The DDP module employs a re-parameterization trick enabling the propagation of gradients back to the compression metrics, offering a layer-specific compression rate without additional computational burden during inference.
Unified Compression Framework: Unlike previous methods that separately handle pruning and merging, DiffRate combines both processes into a singular, cohesive framework. The combination facilitates a balanced approach where the deficiencies of one method (like potential loss of informative tokens through pruning) are alleviated by the strengths of the other (preservation through merging), maximizing overall efficiency.
Demonstrated Effectiveness: Through comprehensive experimentation, DiffRate is shown to achieve superior performance across several benchmark tests. Without fine-tuning, the framework achieves substantial FLOPs reduction (up to 40%) with minimal accuracy drop (0.16% on ImageNet), outperforming many previous approaches that require fine-tuning.

Implications and Future Work

The implications of adopting DiffRate in practical applications are substantial, promising improved efficiency and performance of ViTs without the cumbersome manual tuning of hyperparameters. The approach's ability to harmonize and optimize token compression rates dynamically across different contexts could lead to enhanced deployment of vision transformers across a broader range of devices and applications, particularly where computational resources are a constraint.

In terms of future directions, the paradigm established by DiffRate opens pathways for exploring its integration with other efficiency-focused architectures and methods, such as hardware-specific acceleration or integration with neural architecture search frameworks. As the breadth of ViTs' applicability expands, such flexible and adaptive methods will become increasingly vital.

In conclusion, DiffRate presents a significant advancement in the efficient processing of vision transformers, demonstrating how differentiable approaches to architecture optimization can lead to robust and scalable solutions in machine learning. This contribution holds promise for enhancing both the theoretical understanding and practical implementation of advanced neural networks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - OpenGVLab/DiffRate: [ICCV 23]An approach to enhance the efficiency of Vision Transformer (ViT) by concurrently employing token pruning and token merging techniques, while incorporating a differentiable compression rate. (86 stars)