Dynamic Vision Mamba (DyVM)
- Dynamic Vision Mamba (DyVM) is an SSM-based architecture that dynamically prunes tokens and selects blocks to optimize spatial and spatiotemporal processing.
- It leverages learnable mechanisms, including Gumbel-Softmax for token pruning and per-sample block gating, to reduce computational redundancy and improve efficiency.
- Empirical results show significant reductions in FLOPs and latency with minimal accuracy loss, making DyVM a hardware-friendly solution across image, video, and dense prediction tasks.
Dynamic Vision Mamba (DyVM) extends the Mamba family of State Space Model (SSM)-based neural architectures to realize efficient, hardware-friendly processing of spatial and spatiotemporal vision data. Designed to mitigate spatial redundancy and computational bottlenecks arising in vanilla Vision Mamba models, DyVM introduces learnable dynamic mechanisms for both token-level (patch) and block-level (layer) adaptivity. This enables significant reductions in floating-point operation counts (FLOPs) and wall-clock latency with only minor accuracy trade-offs, while broadly generalizing across image, video, and dense prediction modalities (Wu et al., 7 Apr 2025, Liu et al., 2024). The DyVM methodology encompasses token pruning through sequence rearrangement, dynamic SSM block skipping, and, in related frameworks, adaptive resolution assignment and high-resolution dynamic state-space variants.
1. Foundations: SSMs and Vision Mamba
Vision Mamba replaces transformer-style self-attention with recurrent-like SSM layers—termed Mamba blocks—achieving linear time and memory complexity in sequence length by exploiting convolutional and scan-based SSMs rather than scaling of attention layers (Wu et al., 7 Apr 2025, Liu et al., 2024). Each token sequence undergoes the recurrence: where and , with global output kernel interpretable as a data-dependent convolution. This mechanism underpins high-throughput, long-range context modeling across vision and sequence data, forming the backbone for DyVM adaptations.
2. DyVM Token Pruning: Consistent, Learnable Adaptivity
Spatial redundancy in Vision Mamba results from patch tokens contributing negligibly to global context (95% of pixels below 70% attention, e.g., HiddenMambaAttn) (Wu et al., 7 Apr 2025). Prior token pruning methods create mismatches between training and inference (due to token positions and hidden-state recurrence distances), or increase inference computation, as in the HiddenAlign approach.
DyVM overcomes this by interleaving lightweight, learnable pruning predictors at several model depths. At pruning stage , a predictor outputs per-token retain/prune logits, with the Gumbel-Softmax trick producing a mask . During training, retained tokens are reordered to the contiguous prefix and pruned tokens are appended, recreating the same token layout as at inference. This rearrangement ensures hidden-state recurrences and outputs are mathematically and computationally aligned between training and deployment, preserving the SSM properties and avoiding excess computation. Pruning stages (typically ) target progressively lower retained token ratios while always retaining the class token (Wu et al., 7 Apr 2025).
3. Dynamic Block Selection: Per-Sample SSM Skipping
Block redundancy arises since Vision Mamba models often run both a forward and backward SSM scan in each layer, but empirical measurements reveal that removing one or both can lead to up to throughput gains for small FLOPs decrease, indicating memory and arithmetic bottlenecks are SSM block dominated (Wu et al., 7 Apr 2025).
In DyVM, at each layer , the class token embedding is processed by a two-layer MLP , producing logits for forward/backward block utilization. Gumbel-Sigmoid is applied to obtain gating masks, effecting per-sample selection of whether to execute forward and/or backward SSM passes. The actual layer output is: Learned block gating outperforms random or static skipping in ablation studies (Wu et al., 7 Apr 2025).
4. Algorithmic and Implementation Details
DyVM’s token pruning predictor consists of a convolution, followed by LayerNorm, GeLU, and a linear projection to two logits. Temperature annealing is used for Gumbel-Softmax, and the class token is always preserved. The block selector MLP uses a 128-dimensional hidden layer; bias initialization favors retaining both blocks initially to maintain compatibility with pretrained backbones. For segmentation tasks, pruned tokens are not dropped but their updates are halted, and full features are restored for the decoder (Wu et al., 7 Apr 2025).
Losses include: (a) token and block ratio objectives to match target sparsity; (b) primary classification loss; (c) distillation losses for both output and tokens. This joint loss maintains task accuracy during aggressive pruning/block-skipping schedules.
5. Empirical Results: FLOPs, Accuracy, and Generality
Experiments on ImageNet classification, Kinetics-400 video action recognition, and ADE20K segmentation demonstrate the efficacy of DyVM (Wu et al., 7 Apr 2025):
- ImageNet: DyVM achieves a 35.2% FLOPs reduction (Vim-S, G) with only a 1.7% top-1 accuracy drop (80.5% to 78.8%). Vim-T and Vim-B models show similar FLOPs/accuracy gains.
- Kinetics-400: DyVM reduces FLOPs by 26.2% with 1.3% accuracy drop.
- Semantic segmentation: DyVM reduces mIoU by 2.9 points (Vim-S+DyVM 42.0 vs. 44.9) with large FLOPs savings.
Ablation studies confirm that combining token and block adaptivity yields optimal efficiency/accuracy trade-offs, learnable block/token selection outperforms static or random, and incorporating distillation terms recovers 0.3–0.4% accuracy (Wu et al., 7 Apr 2025). Throughput gains of up to 56% are observed on A6000/A100 GPUs for large models.
6. Broader DyVM Variants and Related Frameworks
Coarse-to-Fine and High-Resolution Variants
- Coarse-to-Fine Vision Mamba (CF-ViM/MambaScope): Instead of uniform token pruning, CF-ViM processes all images in coarse resolution first and selectively refines only high-importance or uncertain regions at fine resolution. This approach leverages token-importance metrics computed from SSM activations and achieves up to 47% FLOPs reduction with no loss or a gain in top-1 accuracy compared to DyVM and baselines (Liu et al., 29 Nov 2025).
- HRVMamba (with Dynamic Visual State Space, DVSS blocks): For dense prediction, the DVSS block augments SSMs with deformable convolutions and multi-scale depthwise convolutions, mitigating long-range forgetting and restoring multi-scale locality. HRVMamba maintains high-resolution parallel feature streams for output, achieving state-of-the-art results on COCO pose estimation and semantic segmentation while preserving scaling and high spatial fidelity (Zhang et al., 2024).
3D/Spatiotemporal and Biomedical Extensions
- Vision Mamba for 3D/Video (DyVM): Frameworks extend SSM blocks to handle tensors, implementing spatial and temporal selective scan parameters. Dynamic state matrices enable time-varying SSM updating within video, allowing efficient global temporal and spatial context accumulation (Liu et al., 2024).
- 3D MRI Medical Imaging: DyVM architectures achieve efficient inference and competitive accuracy on Alzheimer’s disease detection benchmarks by applying adaptive selective scan mechanisms and parallel convolutional/SSM processing to high-dimensional voxel data, with FLOPs and memory improvements over ViT and CNNs (A et al., 2024).
7. Limitations, Theoretical Insights, and Future Directions
While DyVM provides consistent token/block adaptivity mechanisms and universally reduces computation, certain limitations persist:
- Linear-complexity SSMs cannot adaptively focus tokens with full selectivity like self-attention, limiting fine-grained adaptivity in complex scenes (Liu et al., 2024).
- Sequential SSM recurrence in inference, though linear in length, introduces latency bottlenecks for extremely long sequences or dense spatial grids (A et al., 2024).
- FLOPs savings sometimes correspond to smaller real wall-clock savings, as memory and parallelization constraints become limiting (Wu et al., 7 Apr 2025).
- DyVM efficacy depends on careful tuning of target sparsity and gating thresholds; performance on challenging datasets may saturate due to information loss in extreme sparsification (Liu et al., 29 Nov 2025).
Open research topics include jointly learnable pruning/block ratios per layer, hybrid DyVM-attention architectures, low-rank or quantization co-optimization, dynamic multi-scale scan strategies, and extension to multimodal spatiotemporal and event-based vision systems (Wu et al., 7 Apr 2025, Liu et al., 2024).
Summary Table: DyVM Efficiency and Accuracy (ImageNet, Vim-S)
| Model | FLOPs (G) | Top-1 (%) | FLOPs Reduction | Δ Top-1 |
|---|---|---|---|---|
| Vim-S (base) | 5.08 | 80.5 | – | – |
| Vim-S + DyVM | 3.29 | 78.8 | –35.2% | –1.7 |
| Vim-S + HA | 3.60 | 78.8 | –29.1% | –1.7 |
(From (Wu et al., 7 Apr 2025), Table 1)
DyVM establishes a broadly applicable, efficient, and mathematically consistent framework for dynamic adaptivity in SSM-based vision backbones. Its token and block pruning techniques, coherence between training and inference, and adaptability to a wide range of input and task domains make it a pivotal variant among recent state space vision models (Wu et al., 7 Apr 2025, Liu et al., 29 Nov 2025, Zhang et al., 2024, Liu et al., 2024, A et al., 2024).