Pruning Self-attentions into Convolutional Layers in Single Path (2111.11802v4)

Published 23 Nov 2021 in cs.CV and cs.LG

Abstract: Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks. However, modeling global correlations with multi-head self-attention (MSA) layers leads to two widely recognized issues: the massive computational resource consumption and the lack of intrinsic inductive bias for modeling local visual patterns. To solve both issues, we devise a simple yet effective method named Single-Path Vision Transformer pruning (SPViT), to efficiently and automatically compress the pre-trained ViTs into compact models with proper locality added. Specifically, we first propose a novel weight-sharing scheme between MSA and convolutional operations, delivering a single-path space to encode all candidate operations. In this way, we cast the operation search problem as finding which subset of parameters to use in each MSA layer, which significantly reduces the computational cost and optimization difficulty, and the convolution kernels can be well initialized using pre-trained MSA parameters. Relying on the single-path space, we introduce learnable binary gates to encode the operation choices in MSA layers. Similarly, we further employ learnable gates to encode the fine-grained MLP expansion ratios of FFN layers. In this way, our SPViT optimizes the learnable gates to automatically explore from a vast and unified search space and flexibly adjust the MSA-FFN pruning proportions for each individual dense model. We conduct extensive experiments on two representative ViTs showing that our SPViT achieves a new SOTA for pruning on ImageNet-1k. For example, our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously. The source code is available at https://github.com/ziplab/SPViT.

PDF Abstract

Pruning Self-attentions into Convolutional Layers in Single Path

The paper addresses the prominent issues of computational inefficiency and lack of locality within Vision Transformers (ViTs) by proposing a novel pruning methodology named Single-Path Vision Transformer pruning (SPViT). This research aims to efficiently compress pre-trained ViTs by integrating convolutional operations, thereby introducing locality to the models which is inherently absent in multi-head self-attention (MSA) layers.

Key Contributions

The primary contribution is the development of a weight-sharing scheme between MSA and convolutional operations. This scheme leverages a subset of the MSA parameters to express convolutional operations, facilitating the transformation of MSA into convolutions within the same computational graph while optimizing the system's efficiency. The approach involves adapting the architecture search problem as a subset selection challenge within the MSA layers, which substantially reduces computational burdens and complexity during optimization.

Through the single-path weight-sharing scheme, SPViT introduces learnable binary gates to select operations dynamically—either MSA or convolutions—at each layer during the pruning process. Simultaneously, it utilizes learnable gates for MLP expansion ratios within feed-forward networks (FFNs) to optimize hidden dimensions, thus reducing computational overhead while maintaining performance.

Methodology

The proposed weight-sharing scheme and SPViT method are particularly effective due to their ability to automatically determine efficient hybrid architectures that balance global operations (i.e., MSAs) with local operations (i.e., convolutions).

SPViT defines a search space integrating both MSA and FFN modifications. The search phase optimizes learnable binary gate parameters, exploring an array of architecture configurations optimized for computational complexity reduction. The fine-tuning phase incorporates knowledge distillation to address any potential accuracy loss due to the introduced pruning.

Experimental Results

Experiments on well-known Vision Transformer models, such as DeiT and Swin, demonstrate that SPViT achieves a new state-of-the-art (SOTA) in terms of pruning performance on ImageNet-1k. Notably, SPViT compresses DeiT-B by 52.0%, concurrently enhancing top-1 accuracy by 0.6%. The results underscore the efficacy of the approach in balancing reduced computational complexity with model performance, providing a robust framework for ViT pruning that also introduces beneficial locality features.

Implications and Future Directions

The implications of this research are substantial. By integrating convolutional operations into the MSA framework through pruning, SPViT not only alleviates the computation cost but also adds inductive bias favorable for visual processing tasks, which is vital for applications requiring deployment on resource-constrained environments.

For future exploration, the investigation into additional layers of granularity in the pruning process could further refine the balance between efficiency and performance. Furthermore, extending the weight-sharing paradigm to encompass more varied convolution architectures or enhancing the adaptability to different input resolutions would be beneficial. Such advancements could propel this methodology into even broader domains, potentially increasing its applicability and effectiveness. In sum, this paper provides a significant step forward in efficient model compression and transformation capabilities within ViTs, heralding a new horizon for model architecture optimization in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Haoyu He (27 papers)
Jianfei Cai (163 papers)
Jing Liu (525 papers)
Zizheng Pan (23 papers)
Jing Zhang (730 papers)
Dacheng Tao (826 papers)
Bohan Zhuang (79 papers)

Citations (32)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - ziplab/SPViT: [TPAMI 2024] This is the official repository for our paper: ''Pruning Self-attentions into Convolutional Layers in Single Path''. (113 stars)