Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Vision Mamba (DyVM)

Updated 22 February 2026
  • Dynamic Vision Mamba (DyVM) is an SSM-based architecture that dynamically prunes tokens and selects blocks to optimize spatial and spatiotemporal processing.
  • It leverages learnable mechanisms, including Gumbel-Softmax for token pruning and per-sample block gating, to reduce computational redundancy and improve efficiency.
  • Empirical results show significant reductions in FLOPs and latency with minimal accuracy loss, making DyVM a hardware-friendly solution across image, video, and dense prediction tasks.

Dynamic Vision Mamba (DyVM) extends the Mamba family of State Space Model (SSM)-based neural architectures to realize efficient, hardware-friendly processing of spatial and spatiotemporal vision data. Designed to mitigate spatial redundancy and computational bottlenecks arising in vanilla Vision Mamba models, DyVM introduces learnable dynamic mechanisms for both token-level (patch) and block-level (layer) adaptivity. This enables significant reductions in floating-point operation counts (FLOPs) and wall-clock latency with only minor accuracy trade-offs, while broadly generalizing across image, video, and dense prediction modalities (Wu et al., 7 Apr 2025, Liu et al., 2024). The DyVM methodology encompasses token pruning through sequence rearrangement, dynamic SSM block skipping, and, in related frameworks, adaptive resolution assignment and high-resolution dynamic state-space variants.

1. Foundations: SSMs and Vision Mamba

Vision Mamba replaces transformer-style self-attention with recurrent-like SSM layers—termed Mamba blocks—achieving linear time and memory complexity O(L)O(L) in sequence length LL by exploiting convolutional and scan-based SSMs rather than O(L2)O(L^2) scaling of attention layers (Wu et al., 7 Apr 2025, Liu et al., 2024). Each token sequence XRL×DX\in\mathbb{R}^{L\times D} undergoes the recurrence: ht=Aˉht1+Bˉxt,yt=Cht,h_t = \bar A h_{t-1} + \bar B x_t,\qquad y_t = C h_t, where Aˉ=exp(ΔA)\bar A = \exp(\Delta A) and Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar B = (\Delta A)^{-1}(\exp(\Delta A)-I)\Delta B, with global output kernel Kˉ\bar K interpretable as a data-dependent convolution. This mechanism underpins high-throughput, long-range context modeling across vision and sequence data, forming the backbone for DyVM adaptations.

2. DyVM Token Pruning: Consistent, Learnable Adaptivity

Spatial redundancy in Vision Mamba results from patch tokens contributing negligibly to global context (95% of pixels below 70% attention, e.g., HiddenMambaAttn) (Wu et al., 7 Apr 2025). Prior token pruning methods create mismatches between training and inference (due to token positions and hidden-state recurrence distances), or increase inference computation, as in the HiddenAlign approach.

DyVM overcomes this by interleaving lightweight, learnable pruning predictors at several model depths. At pruning stage ss, a predictor PP outputs per-token retain/prune logits, with the Gumbel-Softmax trick producing a mask MsM^s. During training, retained tokens are reordered to the contiguous prefix and pruned tokens are appended, recreating the same token layout as at inference. This rearrangement ensures hidden-state recurrences and outputs are mathematically and computationally aligned between training and deployment, preserving the SSM properties and avoiding excess computation. Pruning stages (typically S=3S=3) target progressively lower retained token ratios [ρ,ρ2,ρ3][\rho, \rho^2, \rho^3] while always retaining the class token (Wu et al., 7 Apr 2025).

3. Dynamic Block Selection: Per-Sample SSM Skipping

Block redundancy arises since Vision Mamba models often run both a forward and backward SSM scan in each layer, but empirical measurements reveal that removing one or both can lead to up to 2.83×2.83\times throughput gains for small FLOPs decrease, indicating memory and arithmetic bottlenecks are SSM block dominated (Wu et al., 7 Apr 2025).

In DyVM, at each layer ll, the class token embedding ClC^l is processed by a two-layer MLP GG, producing logits for forward/backward block utilization. Gumbel-Sigmoid is applied to obtain Ql{0,1}B×2\mathbf{Q}^l\in\{0,1\}^{B\times 2} gating masks, effecting per-sample selection of whether to execute forward and/or backward SSM passes. The actual layer output is: LayerNorm(ForwardSSM(Hl)Q:,0l+BackwardSSM(Hl)Q:,1l+Res(Hl)).\text{LayerNorm}(\text{ForwardSSM}(H^l) \odot Q^l_{:,0} + \text{BackwardSSM}(H^l) \odot Q^l_{:,1} + \text{Res}(H^l)). Learned block gating outperforms random or static skipping in ablation studies (Wu et al., 7 Apr 2025).

4. Algorithmic and Implementation Details

DyVM’s token pruning predictor consists of a 1×11\times 1 convolution, followed by LayerNorm, GeLU, and a linear projection to two logits. Temperature annealing is used for Gumbel-Softmax, and the class token is always preserved. The block selector MLP uses a 128-dimensional hidden layer; bias initialization favors retaining both blocks initially to maintain compatibility with pretrained backbones. For segmentation tasks, pruned tokens are not dropped but their updates are halted, and full features are restored for the decoder (Wu et al., 7 Apr 2025).

Losses include: (a) token and block ratio objectives to match target sparsity; (b) primary classification loss; (c) distillation losses for both output and tokens. This joint loss maintains task accuracy during aggressive pruning/block-skipping schedules.

5. Empirical Results: FLOPs, Accuracy, and Generality

Experiments on ImageNet classification, Kinetics-400 video action recognition, and ADE20K segmentation demonstrate the efficacy of DyVM (Wu et al., 7 Apr 2025):

  • ImageNet: DyVM achieves a 35.2% FLOPs reduction (Vim-S, 5.083.295.08\to 3.29G) with only a 1.7% top-1 accuracy drop (80.5% to 78.8%). Vim-T and Vim-B models show similar FLOPs/accuracy gains.
  • Kinetics-400: DyVM reduces FLOPs by 26.2% with 1.3% accuracy drop.
  • Semantic segmentation: DyVM reduces mIoU by 2.9 points (Vim-S+DyVM 42.0 vs. 44.9) with large FLOPs savings.

Ablation studies confirm that combining token and block adaptivity yields optimal efficiency/accuracy trade-offs, learnable block/token selection outperforms static or random, and incorporating distillation terms recovers 0.3–0.4% accuracy (Wu et al., 7 Apr 2025). Throughput gains of up to 56% are observed on A6000/A100 GPUs for large models.

Coarse-to-Fine and High-Resolution Variants

  • Coarse-to-Fine Vision Mamba (CF-ViM/MambaScope): Instead of uniform token pruning, CF-ViM processes all images in coarse resolution first and selectively refines only high-importance or uncertain regions at fine resolution. This approach leverages token-importance metrics computed from SSM activations and achieves up to 47% FLOPs reduction with no loss or a gain in top-1 accuracy compared to DyVM and baselines (Liu et al., 29 Nov 2025).
  • HRVMamba (with Dynamic Visual State Space, DVSS blocks): For dense prediction, the DVSS block augments SSMs with deformable convolutions and multi-scale depthwise convolutions, mitigating long-range forgetting and restoring multi-scale locality. HRVMamba maintains high-resolution parallel feature streams for output, achieving state-of-the-art results on COCO pose estimation and semantic segmentation while preserving O(L)O(L) scaling and high spatial fidelity (Zhang et al., 2024).

3D/Spatiotemporal and Biomedical Extensions

  • Vision Mamba for 3D/Video (DyVM): Frameworks extend SSM blocks to handle B×T×H×W×CB\times T\times H\times W\times C tensors, implementing spatial and temporal selective scan parameters. Dynamic state matrices enable time-varying SSM updating within video, allowing efficient global temporal and spatial context accumulation (Liu et al., 2024).
  • 3D MRI Medical Imaging: DyVM architectures achieve efficient inference and competitive accuracy on Alzheimer’s disease detection benchmarks by applying adaptive selective scan mechanisms and parallel convolutional/SSM processing to high-dimensional voxel data, with FLOPs and memory improvements over ViT and CNNs (A et al., 2024).

7. Limitations, Theoretical Insights, and Future Directions

While DyVM provides consistent token/block adaptivity mechanisms and universally reduces computation, certain limitations persist:

  • Linear-complexity SSMs cannot adaptively focus tokens with full selectivity like self-attention, limiting fine-grained adaptivity in complex scenes (Liu et al., 2024).
  • Sequential SSM recurrence in inference, though linear in length, introduces latency bottlenecks for extremely long sequences or dense spatial grids (A et al., 2024).
  • FLOPs savings sometimes correspond to smaller real wall-clock savings, as memory and parallelization constraints become limiting (Wu et al., 7 Apr 2025).
  • DyVM efficacy depends on careful tuning of target sparsity and gating thresholds; performance on challenging datasets may saturate due to information loss in extreme sparsification (Liu et al., 29 Nov 2025).

Open research topics include jointly learnable pruning/block ratios per layer, hybrid DyVM-attention architectures, low-rank or quantization co-optimization, dynamic multi-scale scan strategies, and extension to multimodal spatiotemporal and event-based vision systems (Wu et al., 7 Apr 2025, Liu et al., 2024).


Summary Table: DyVM Efficiency and Accuracy (ImageNet, Vim-S)

Model FLOPs (G) Top-1 (%) FLOPs Reduction Δ Top-1
Vim-S (base) 5.08 80.5
Vim-S + DyVM 3.29 78.8 –35.2% –1.7
Vim-S + HA 3.60 78.8 –29.1% –1.7

(From (Wu et al., 7 Apr 2025), Table 1)


DyVM establishes a broadly applicable, efficient, and mathematically consistent framework for dynamic adaptivity in SSM-based vision backbones. Its token and block pruning techniques, coherence between training and inference, and adaptability to a wide range of input and task domains make it a pivotal variant among recent state space vision models (Wu et al., 7 Apr 2025, Liu et al., 29 Nov 2025, Zhang et al., 2024, Liu et al., 2024, A et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Vision Mamba (DyVM).