Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks (2207.01580v2)

Published 4 Jul 2022 in cs.CV, cs.AI, and cs.LG

Abstract: In this paper, we present a new approach for model acceleration by exploiting spatial sparsity in visual data. We observe that the final prediction in vision Transformers is only based on a subset of the most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input to accelerate vision Transformers. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. While the framework is inspired by our observation of the sparse attention in vision Transformers, we find the idea of adaptive and asymmetric computation can be a general solution for accelerating various architectures. We extend our method to hierarchical models including CNNs and hierarchical vision Transformers as well as more complex dense prediction tasks that require structured feature maps by formulating a more generic dynamic spatial sparsification framework with progressive sparsification and asymmetric computation for different spatial locations. By applying lightweight fast paths to less informative features and using more expressive slow paths to more important locations, we can maintain the structure of feature maps while significantly reducing the overall computations. Extensive experiments demonstrate the effectiveness of our framework on various modern architectures and different visual recognition tasks. Our results clearly demonstrate that dynamic spatial sparsification offers a new and more effective dimension for model acceleration. Code is available at https://github.com/raoyongming/DynamicViT

PDF Abstract

Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks

The paper introduces a novel framework for enhancing the efficiency of Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) by dynamically exploiting spatial sparsity. This approach is primarily focused on reducing computational complexity without significantly impacting model performance.

Dynamic Token Sparsification Framework

The authors observe that ViTs rely on a subset of crucial regions for accurate image recognition, suggesting potential for optimizations through spatial sparsity. Therefore, they propose a dynamic token sparsification technique that iteratively prunes redundant regions based on learned importance within the input data.

Several key components define the framework:

Prediction Module: A lightweight module predicting token importance across layers. The module uses a combination of local and global features to make predictions, effectively contributing to informed token pruning decisions.
Hierarchical Sparsification: The framework implements a progressive pruning strategy across multiple stages, ensuring redundancy is minimized at each step while maintaining performance.
Attention Masking: During training, to address non-differentiation introduced by token pruning, an attention masking strategy is employed, allowing end-to-end learning without affecting overall architecture compatibility.

Asymmetric Computation for Hierarchical Models

For hierarchical models like CNNs and Swin Transformers, the paper extends the sparsification framework to accommodate their structure-sensitive operations. The idea is to preserve essential structures while introducing a fast path for less informative regions, thereby maintaining expressiveness with reduced computation.

Fast and Slow Paths: Informative features are processed using a slow path, similar to existing layers, while less informative features are processed using a lightweight fast path.
Generic Application: The framework is adaptable across various architectures, addressing structural inflexibility seen in other sparsity exploitation methods like sparse convolutions.

Empirical Validation

Experiments on ImageNet, ADE20k, and COCO datasets underscore the efficacy of the proposed method:

ViTs: Demonstrated significant reduction in FLOPs (31%-35%) with negligible accuracy drops (0.2%-0.5%).
CNNs: Achieved more than 20% reduction in FLOPs without compromising accuracy, reflecting better efficiency.
Hierarchical Models: Effectively reduced computational demands while maintaining performance across complex visual tasks like segmentation and detection.

Implications and Future Directions

Practically, this framework suggests an effective avenue for deploying deep learning models on resource-constrained devices, optimizing energy consumption and inference time. Theoretically, it highlights leveraging spatial sparseness in data as a robust strategy for model acceleration.

Looking forward, this work may pioneer further research into integrating sparsification with existing acceleration techniques, improving compatibility with emerging hardware architectures. The balance between dropped information and maintained performance, especially across different input resolutions and varied downstream tasks, remains a promising field for exploration.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Yongming Rao (50 papers)
Zuyan Liu (11 papers)
Wenliang Zhao (22 papers)
Jie Zhou (687 papers)
Jiwen Lu (192 papers)

Citations (25)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - raoyongming/DynamicViT: [NeurIPS 2021] [T-PAMI] DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification (539 stars)