Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification (2106.02034v2)

Published 3 Jun 2021 in cs.CV, cs.AI, and cs.LG
DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

Abstract: Attention is sparse in vision transformers. We observe the final prediction in vision transformers is only based on a subset of most informative tokens, which is sufficient for accurate image recognition. Based on this observation, we propose a dynamic token sparsification framework to prune redundant tokens progressively and dynamically based on the input. Specifically, we devise a lightweight prediction module to estimate the importance score of each token given the current features. The module is added to different layers to prune redundant tokens hierarchically. To optimize the prediction module in an end-to-end manner, we propose an attention masking strategy to differentiably prune a token by blocking its interactions with other tokens. Benefiting from the nature of self-attention, the unstructured sparse tokens are still hardware friendly, which makes our framework easy to achieve actual speed-up. By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%~37% FLOPs and improves the throughput by over 40% while the drop of accuracy is within 0.5% for various vision transformers. Equipped with the dynamic token sparsification framework, DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet. Code is available at https://github.com/raoyongming/DynamicViT

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification

The paper "DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification" explores a method to enhance the efficiency and speed of vision transformers (ViTs), a leading architecture in computer vision tasks such as image classification and object detection. The authors propose a novel approach involving dynamic token sparsification, which intelligently prunes redundancy in the input tokens, thus reducing computational complexity without significant loss in accuracy.

Key Contributions

DynamicViT introduces a hierarchical framework to prune redundant tokens dynamically across different layers of the transformer model. This process is achieved using a lightweight prediction module that computes the importance score of each token using the token's features at a particular layer. By progressively pruning these tokens, DynamicViT can achieve substantial reductions in computational load.

Attention Masking Strategy: To enable end-to-end optimization, the authors propose an attention masking strategy. This strategy allows for differentiable pruning by blocking the interactions of pruned tokens with others, maintaining the self-attention mechanism’s robustness while reducing the computational burden.

Numerical Results and Implications

In their experimental evaluations on the ImageNet dataset, the authors report that the DynamicViT can prune approximately 66% of the input tokens. This pruning results in a reduction of 31% to 37% in FLOPs and enhances throughput by over 40%. Importantly, this efficiency is achieved with a minimal accuracy drop of less than 0.5% across various transformer architectures, including DeiT and LV-ViT models.

These results indicate that DynamicViT offers a competitive trade-off between model complexity and accuracy, outperforming existing efficient transformer architectures. The implications of this work are significant for real-time applications where computational resources are limited.

Theoretical and Practical Implications

Theoretically, this work introduces a paradigm shift in how transformer models can leverage input sparsity. By focusing on the most informative tokens and hierarchically pruning the less important ones, DynamicViT capitalizes on the self-attention mechanism's flexibility to handle variable-length inputs. This approach does not only retain accuracy but also optimizes the model for speed and efficiency.

Practically, the reduced computational footprint makes DynamicViT well-suited for deployment in edge computing scenarios, where resources are constrained. It particularly benefits applications needing real-time processing without compromising the accuracy of the predictions.

Future Directions

Future research may build upon this work by exploring similar sparsification strategies in other types of transformers or even other model architectures. Additionally, the integration of such dynamic mechanisms in multi-modal models, or extending this approach to video processing tasks where temporal dependencies could be similarly exploited, might open new avenues of research. Exploration into how these strategies could optimize models for tasks beyond classification, such as dense prediction tasks like segmentation, could also be an interesting area of development.

Conclusion

DynamicViT addresses a critical need in the efficient deployment of vision transformers through intelligent token pruning. By maintaining a fine balance between computational efficiency and model accuracy, this work presents a robust framework that could significantly impact practical deployments in various AI applications. The methods and results set a strong precedent for future innovations aimed at optimizing neural network performance in resource-constrained environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yongming Rao (50 papers)
  2. Wenliang Zhao (22 papers)
  3. Benlin Liu (11 papers)
  4. Jiwen Lu (192 papers)
  5. Jie Zhou (687 papers)
  6. Cho-Jui Hsieh (211 papers)
Citations (580)