Chasing Sparsity in Vision Transformers: An End-to-End Exploration (2106.04533v3)

Published 8 Jun 2021 in cs.CV and cs.AI

Abstract: Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end". Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes improve the ViT accuracy rather than compromising it, making sparsity a tantalizing "free lunch". For example, our sparsified DeiT-Small at (5%, 50%) sparsity for (data, architecture), improves 0.28% top-1 accuracy, and meanwhile enjoys 49.32% FLOPs and 4.40% running time savings. Our codes are available at https://github.com/VITA-Group/SViTE.

PDF Abstract

Overview of Sparse Vision Transformers

The paper "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" investigates the integration of sparsity into Vision Transformers (ViTs) to enhance computational efficiency without sacrificing model accuracy. The authors introduce methods to exploit both unstructured and structured sparsity throughout the entire training pipeline of ViTs, offering a comprehensive exploration into their efficient training and deployment.

Main Contributions

The major contributions of the paper can be categorized into three innovative approaches:

Sparse Vision Transformer Exploration (SViTE):
- This approach dynamically extracts and trains sparse subnetworks from ViTs rather than training full dense models, maintaining a fixed parameter budget. By leveraging methods from sparse training literature, the authors achieve substantial reductions in memory overhead and inference costs.
- Numerical results indicate that SViTE-Tiny, -Small, and -Base can achieve up to 57.50% FLOPs reduction while maintaining competitive accuracy, demonstrating the approach's effectiveness at varying sparsity levels.
Structured Sparse Vision Transformer Exploration (S $^2$ ViTE):
- S $^2$ ViTE advances the sparse exploration by targeting structured sparsity, which is more aligned with hardware acceleration. The approach uses first-order importance approximation to guide pruning and regrowth focusing on self-attention heads within ViTs.
- Documented improvements show that at 40% structured sparsity, S $^2$ ViTE-Base retains or even enhances model accuracy while yielding up to 33.13% FLOPs and 24.70% real-time running cost savings compared to its dense counterpart.
Sparse Vision Transformer Co-Exploration (SViTE+):
- The authors introduce a novel learnable token selector capable of adaptively determining critical data patches, sparsing both data input and the model architecture.
- When combined with structured sparsity, SViTE+ not only achieves efficient data processing and model execution but also leverages data sparsity as a regularizer that surprisingly boosts model generalization, presenting substantial FLOPs and time savings.

Implications and Future Directions

The paper's findings suggest that well-designed sparsity strategies can significantly reduce ViT model and computation burdens, driving the applicability of large-scale transformer models in resource-constrained environments. The enhanced exploration of sparsity within ViTs paves the way for further innovations in AI deployment, particularly in deep learning accelerators and embedded AI systems.

Looking forward, this research encourages a holistic view of sparsity, where both architecture and data play integral roles. The concept of an embodied token selector opens avenues for discovering models that are both lightweight and robust. Additionally, combining SViTE methods with other efficiency techniques, such as attention approximation and quantization, may further enhance computational scaling for more powerful AI applications. The principles outlined in this work are ripe for exploration on newer hardware that could capitalize on the sparseness for real-time execution, addressing an important frontier in AI efficiency research.

In conclusion, this paper stands as a significant exploration in optimizing transformers for vision tasks, setting a precedent for future endeavours in efficient AI model design and implementation.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Tianlong Chen (202 papers)
Yu Cheng (354 papers)
Zhe Gan (135 papers)
Lu Yuan (130 papers)
Lei Zhang (1689 papers)
Zhangyang Wang (374 papers)

Citations (193)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - VITA-Group/SViTE: [NeurIPS'21] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang (90 stars)