Dynamic Spatial Sparsification for Efficient Vision Transformers and Convolutional Neural Networks
The paper introduces a novel framework for enhancing the efficiency of Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) by dynamically exploiting spatial sparsity. This approach is primarily focused on reducing computational complexity without significantly impacting model performance.
Dynamic Token Sparsification Framework
The authors observe that ViTs rely on a subset of crucial regions for accurate image recognition, suggesting potential for optimizations through spatial sparsity. Therefore, they propose a dynamic token sparsification technique that iteratively prunes redundant regions based on learned importance within the input data.
Several key components define the framework:
- Prediction Module: A lightweight module predicting token importance across layers. The module uses a combination of local and global features to make predictions, effectively contributing to informed token pruning decisions.
- Hierarchical Sparsification: The framework implements a progressive pruning strategy across multiple stages, ensuring redundancy is minimized at each step while maintaining performance.
- Attention Masking: During training, to address non-differentiation introduced by token pruning, an attention masking strategy is employed, allowing end-to-end learning without affecting overall architecture compatibility.
Asymmetric Computation for Hierarchical Models
For hierarchical models like CNNs and Swin Transformers, the paper extends the sparsification framework to accommodate their structure-sensitive operations. The idea is to preserve essential structures while introducing a fast path for less informative regions, thereby maintaining expressiveness with reduced computation.
- Fast and Slow Paths: Informative features are processed using a slow path, similar to existing layers, while less informative features are processed using a lightweight fast path.
- Generic Application: The framework is adaptable across various architectures, addressing structural inflexibility seen in other sparsity exploitation methods like sparse convolutions.
Empirical Validation
Experiments on ImageNet, ADE20k, and COCO datasets underscore the efficacy of the proposed method:
- ViTs: Demonstrated significant reduction in FLOPs (31%-35%) with negligible accuracy drops (0.2%-0.5%).
- CNNs: Achieved more than 20% reduction in FLOPs without compromising accuracy, reflecting better efficiency.
- Hierarchical Models: Effectively reduced computational demands while maintaining performance across complex visual tasks like segmentation and detection.
Implications and Future Directions
Practically, this framework suggests an effective avenue for deploying deep learning models on resource-constrained devices, optimizing energy consumption and inference time. Theoretically, it highlights leveraging spatial sparseness in data as a robust strategy for model acceleration.
Looking forward, this work may pioneer further research into integrating sparsification with existing acceleration techniques, improving compatibility with emerging hardware architectures. The balance between dropped information and maintained performance, especially across different input resolutions and varied downstream tasks, remains a promising field for exploration.