Exploring Token Pruning in Vision State Space Models (2409.18962v1)

Published 27 Sep 2024 in cs.CV, cs.AI, and cs.LG

Abstract: State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. However, direct applications of existing token pruning techniques designed for ViTs fail to deliver good performance, even with extensive fine-tuning. To address this issue, we revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models. We first introduce a pruning-aware hidden state alignment method to stabilize the neighborhood of remaining tokens for performance enhancement. Besides, based on our detailed analysis, we propose a token importance evaluation method adapted for SSM models, to guide the token pruning. With efficient implementation and practical acceleration methods, our method brings actual speedup. Extensive experiments demonstrate that our approach can achieve significant computation reduction with minimal impact on performance across different tasks. Notably, we achieve 81.7\% accuracy on ImageNet with a 41.6\% reduction in the FLOPs for pruned PlainMamba-L3. Furthermore, our work provides deeper insights into understanding the behavior of SSM-based vision models for future research.

Authors (11)

Zheng Zhan (27 papers)
Zhenglun Kong (33 papers)
Yifan Gong (82 papers)
Yushu Wu (17 papers)
Zichong Meng (6 papers)
Hangyu Zheng (1 paper)
Xuan Shen (29 papers)
Stratis Ioannidis (67 papers)
Wei Niu (68 papers)
Pu Zhao (82 papers)
Yanzhi Wang (197 papers)

Citations (3)

View on Semantic Scholar

Summary

Exploring Token Pruning in Vision State Space Models

The paper "Exploring Token Pruning in Vision State Space Models," introduces a novel approach to improve the efficiency of State Space Models (SSMs) in vision tasks by implementing token pruning strategies. The paper identifies the limitations of applying traditional token pruning methods, designed for Vision Transformers (ViTs), to SSMs and proposes an alternative strategy that better aligns with the computational characteristics of SSMs.

Overview and Motivation

State Space Models are gaining traction in visual tasks due to their linear computational complexity, in contrast to the quadratic complexity associated with self-attention in transformers. This paper acknowledges the efficiency records set by SSM-based models such as VMamba and their linear scan mechanism, which motivates the exploration of efficiency improvements in these models, particularly through token-based pruning.

Current token pruning methods in ViTs demonstrate efficiency in reducing computational load by focusing on a subset of informative tokens. However, the paper highlights a significant challenge: the computational framework of SSMs does not align naturally with these existing methods. The naive application of token pruning, as used in ViTs, results in substantial drops in model accuracy for SSMs, undermining the potential benefits of such approaches.

Methodology and Contributions

To address the misalignment noted above, the researchers introduce a customized token pruning strategy specifically tailored for SSM-based vision models. This approach incorporates:

Pruning-Aware Hidden State Alignment: By maintaining the sequential nature of token positions across pruning actions, this method stabilizes the processing neighborhood of remaining tokens, a critical advancement over traditional methods that disrupt token adjacency and consequently degrade model performance.
Token Importance Evaluation: The method innovates in assessing token importance by using a channel-space aggregation, which effectively determines which tokens can be pruned without sacrificing performance. This is notably adapted to leverage SSM's high-dimensional channel space, setting it apart from conventional metric-based evaluations such as $\ell_1$ or $\ell_2$ norms.
Efficient Implementation: Practical acceleration techniques are outlined, involving efficient computation strategies that enhance token pruning's efficacy, ensuring substantial reductions in FLOPs with minimal accuracy loss.

Results and Implications

The paper documents extensive experiments showing that the proposed method achieves significant computation reductions with minimal to no impact on performance across tasks such as image classification on ImageNet-1K and object detection and segmentation on COCO 2017. For instance, pruned models like PlainMamba-L3 achieve up to a 41.4% reduction in FLOPs while maintaining an accuracy of around 81.7% on ImageNet.

The results demonstrate that this novel token pruning approach not only boosts computational efficiency but also maintains or even enhances the interpretability and reliability of SSM-based models for computer vision tasks.

Future Directions

The research opens up several avenues for future inquiry into AI and deep learning model optimization:

Model Interpretability: By further exploring how token adjacency and sequence alignment influence model interpretability, practitioners can gain deeper insights into model decision-making processes.
Generalizability Across Architectures: The adaptability of this pruning method to a broader range of backbones, beyond SSMs, could be explored, potentially extending benefits to a wider series of neural network architectures.
Fine-Tuning Repercussions: Further research might examine the role of fine-tuning in counteracting performance deficits induced by token pruning and establishing benchmarks for recovery of perturbed models.

In conclusion, the proposed token pruning methodology paves the way for more computationally efficient deep learning models without sacrificing performance, preserving the characteristics crucial for the operational efficacy of SSM-based vision models. This paper contributes valuable insights and tools for enhancing the efficiency of burgeoning state-space model architectures in the domain of computer vision.

PDF Markdown

Related Papers

Tweets

https://twitter.com/gm8xx8/status/1840861651525841173

YouTube

Show All Videos