Pruning Self-attentions into Convolutional Layers in Single Path
The paper addresses the prominent issues of computational inefficiency and lack of locality within Vision Transformers (ViTs) by proposing a novel pruning methodology named Single-Path Vision Transformer pruning (SPViT). This research aims to efficiently compress pre-trained ViTs by integrating convolutional operations, thereby introducing locality to the models which is inherently absent in multi-head self-attention (MSA) layers.
Key Contributions
The primary contribution is the development of a weight-sharing scheme between MSA and convolutional operations. This scheme leverages a subset of the MSA parameters to express convolutional operations, facilitating the transformation of MSA into convolutions within the same computational graph while optimizing the system's efficiency. The approach involves adapting the architecture search problem as a subset selection challenge within the MSA layers, which substantially reduces computational burdens and complexity during optimization.
Through the single-path weight-sharing scheme, SPViT introduces learnable binary gates to select operations dynamically—either MSA or convolutions—at each layer during the pruning process. Simultaneously, it utilizes learnable gates for MLP expansion ratios within feed-forward networks (FFNs) to optimize hidden dimensions, thus reducing computational overhead while maintaining performance.
Methodology
The proposed weight-sharing scheme and SPViT method are particularly effective due to their ability to automatically determine efficient hybrid architectures that balance global operations (i.e., MSAs) with local operations (i.e., convolutions).
SPViT defines a search space integrating both MSA and FFN modifications. The search phase optimizes learnable binary gate parameters, exploring an array of architecture configurations optimized for computational complexity reduction. The fine-tuning phase incorporates knowledge distillation to address any potential accuracy loss due to the introduced pruning.
Experimental Results
Experiments on well-known Vision Transformer models, such as DeiT and Swin, demonstrate that SPViT achieves a new state-of-the-art (SOTA) in terms of pruning performance on ImageNet-1k. Notably, SPViT compresses DeiT-B by 52.0%, concurrently enhancing top-1 accuracy by 0.6%. The results underscore the efficacy of the approach in balancing reduced computational complexity with model performance, providing a robust framework for ViT pruning that also introduces beneficial locality features.
Implications and Future Directions
The implications of this research are substantial. By integrating convolutional operations into the MSA framework through pruning, SPViT not only alleviates the computation cost but also adds inductive bias favorable for visual processing tasks, which is vital for applications requiring deployment on resource-constrained environments.
For future exploration, the investigation into additional layers of granularity in the pruning process could further refine the balance between efficiency and performance. Furthermore, extending the weight-sharing paradigm to encompass more varied convolution architectures or enhancing the adaptability to different input resolutions would be beneficial. Such advancements could propel this methodology into even broader domains, potentially increasing its applicability and effectiveness. In sum, this paper provides a significant step forward in efficient model compression and transformation capabilities within ViTs, heralding a new horizon for model architecture optimization in AI.