Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space (2201.00814v2)

Published 3 Jan 2022 in cs.CV, cs.AI, and cs.LG

Abstract: This paper explores the feasibility of finding an optimal sub-model from a vision transformer and introduces a pure vision transformer slimming (ViT-Slim) framework. It can search a sub-structure from the original model end-to-end across multiple dimensions, including the input tokens, MHSA and MLP modules with state-of-the-art performance. Our method is based on a learnable and unified $\ell_1$ sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. The searching process is highly efficient through a single-shot training scheme. For instance, on DeiT-S, ViT-Slim only takes ~43 GPU hours for the searching process, and the searched structure is flexible with diverse dimensionalities in different modules. Then, a budget threshold is employed according to the requirements of accuracy-FLOPs trade-off on running devices, and a re-training process is performed to obtain the final model. The extensive experiments show that our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by ~0.6% on ImageNet. We also demonstrate the advantage of our searched models on several downstream datasets. Our code is available at https://github.com/Arnav0400/ViT-Slim.

Authors (6)

Arnav Chavan (15 papers)
Zhiqiang Shen (172 papers)
Zhuang Liu (63 papers)
Zechun Liu (48 papers)
Kwang-Ting Cheng (96 papers)
Eric Xing (127 papers)

Citations (64)

View on Semantic Scholar

Summary

The paper introduces ViT-Slim, a novel continuous optimization framework that efficiently identifies optimal sub-networks in vision transformers.
It leverages ℓ1 sparsity constraints and learnable masks to reduce parameters and FLOPs by up to 40%, while enhancing accuracy by approximately 0.6%.
The single-shot training scheme completes the search in 43 GPU hours and demonstrates strong transferability across various datasets.

An Insightful Overview of "Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space"

The paper "Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space" introduces a novel approach to optimize vision transformer models through an architecture search framework termed ViT-Slim. The framework aims to compress these models by finding optimal sub-models, thus addressing the common challenge of large model sizes and high computational costs associated with vision transformers. The central contribution lies in efficiently searching for structured sub-networks within a vision transformer by leveraging a novel continuous, multidimensional optimization strategy.

Methodology and Techniques

ViT-Slim builds upon the concept of utilizing $\ell_1$ sparsity constraints to rank importance across different dimensions of a vision transformer, including input tokens, Multi-Head Self-Attention (MHSA), and Multi-Layer Perceptron (MLP) modules. Unlike traditional approaches that rely heavily on discrete search spaces and pre-defined candidate blocks, ViT-Slim takes advantage of a continuous search space. This is achieved by defining a supernet and introducing learnable masks that implicitly rank the dimensions based on their contribution to the model's performance, thereby allowing for the derivation of efficient sub-models.

The framework utilizes a single-shot training scheme, significantly reducing the computational overhead compared to conventional methods like reinforcement learning and evolutionary search. This efficiency is exemplified by the fact that it only requires approximately 43 GPU hours to search for an optimized structure in the DeiT-S model.

Empirical Results

The empirical evaluation of ViT-Slim on the ImageNet-1K dataset demonstrates substantial compression capabilities, reducing both the number of parameters and FLOPs by up to 40% while simultaneously increasing the model's accuracy by approximately 0.6%. Moreover, the framework's transferability is illustrated across several downstream datasets, showcasing its robustness and practicality in diverse settings.

Table comparisons in the paper highlight ViT-Slim's superior performance against existing methods such as GLiT, Dynamic-ViT, and AutoFormer concerning both computational efficiency and compressive ability. Notably, ViT-Slim manages to outperform these methods with significantly fewer resources and enhanced efficiency at various budget constraints.

Implications and Future Directions

ViT-Slim's approach introduces several practical and theoretical implications for the field of AI. Practically, it provides a scalable and efficient solution to deploy high-performing vision transformers on resource-constrained devices, thereby broadening the applicability of these models in real-world scenarios. Theoretically, it challenges existing norms in architecture search by emphasizing the role of continuous optimization and differentiable architecture search, potentially paving the way for future research focused on extending these principles beyond vision transformers to other domains.

Speculation on future developments in AI, driven by findings from this paper, might include the adaptation of similar continuous searching frameworks to other forms of deep neural networks, including convolutional and recurrent architectures. The overarching theme of reducing computational costs while boosting performance will likely continue to be a pivotal area of investigation.

In conclusion, this paper presents a comprehensive framework that optimizes vision transformer models by identifying highly efficient sub-structures within them. ViT-Slim's innovative use of a continuous search space for dimension optimization sets a promising precedent for future research in efficient model design and deployment, establishing a foundation upon which further advancements in AI model optimization can be constructed.

PDF Markdown

Related Papers

Unified Visual Transformer Compression (2022)
Super Vision Transformer (2022)
MiniViT: Compressing Vision Transformers with Weight Multiplexing (2022)
Searching the Search Space of Vision Transformer (2021)
Chasing Sparsity in Vision Transformers: An End-to-End Exploration (2021)

GitHub

GitHub - Arnav0400/ViT-Slim: Official code for our CVPR'22 paper “Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space” (246 stars)