Mamba Token Reduction (MTR)

Updated 12 March 2026

MTR is a computational strategy that reduces token counts in state-space models by leveraging importance and similarity metrics.
It employs a hybrid prune-and-merge routine with adaptive scheduling to maintain token order and signal continuity during inference.
Empirical results demonstrate significant FLOPs reduction and speedup with minimal accuracy loss across both language and vision applications.

Mamba Token Reduction (MTR) refers to a family of computational strategies specifically developed for Mamba State-Space Models (SSMs) and their vision extensions to decrease the number of tokens processed during model inference or training. These methods offer substantial acceleration and memory reduction while maintaining, or in some cases even improving, accuracy. MTR is necessary because naïve token pruning or merging approaches from Transformer-based architectures degrade performance in SSMs due to differences in signal propagation and token importance accumulation. The MTR framework, in its various domain-specific forms, is marked by the use of SSM-aware importance metrics, similarity-based merging, token rearrangement, and in select cases, adaptive, dynamic, or progressive schedules.

1. Theoretical Foundations: SSM Signal Propagation and Token Importance

Mamba models, whether for language or vision, are grounded in continuous-time SSMs discretized for sequence processing. Each token update propagates information through hidden states according to

$h_t = \overline{A}\,h_{t-1} + \overline{B}\,x_t,\quad y_t = C\,h_t,$

where the recurrence imparts strong sequence order sensitivity and global convolutional dependencies (Zhan et al., 2024, Ma et al., 18 Jul 2025). In SSMs, each token's contribution to the output cannot be “masked out” without permanently losing its signal across the entire sequence convolution kernel: $\overline{K} = [CB,\,C\overline{A}B,\,\ldots,\,C\overline{A}^{L-1}B],\quad y = x * \overline{K}.$ Pruning tokens irrevocably eliminates components from this kernel, leading to compounded error. Furthermore, SSMs with selective gates (e.g., Mamba) render attention-based pruning approaches (which use Q/K/V maps or [CLS]-similarity) inapplicable or actively harmful, as SSM token propagation requires careful treatment of both importance and position (Zhan et al., 2024, Ma et al., 18 Jul 2025).

2. Unified Mamba Token Reduction Algorithms: Importance, Similarity, and Reduction Steps

Modern MTR methods operate on two principal criteria—importance and similarity—integrated into a hybrid prune-and-merge routine. The canonical intra-layer Mamba MTR algorithm (Zhan et al., 2024) follows:

Token importance assignment: After SSM block application, each token receives an importance score

$S_i = \frac{1}{D'} \sum_{d=1}^{D'}\max(0, y_{i,d}),$

leveraging channel-averaged, ReLU-clipped activations or, for vision, the Mamba per-token selective timescale parameter $\Delta_t$ (Ma et al., 18 Jul 2025).

Token partitioning: Tokens are sorted and split into high- and low-importance groups.
Similarity determination: For each low-importance token $a_i$ , the most similar high-importance token $f_i$ is found via cosine similarity:

$\operatorname{sim}(a,b) = \frac{a \cdot b}{\|a\|\|b\|}.$

Top pair selection: Only the top $p\%$ most similar $(a_i, f_i)$ pairs are considered for reduction.
Prune and merge: Of the kept pairs, a fraction $q\%$ of $a_i$ are pruned, while the remaining $(1-q)\%$ are merged with $f_i$ , replacing $f_i \leftarrow (a_i + f_i)/2$ .
Sequence reconstruction: Reduced tokens are reassembled in the correct order for continuity and downstream SSM recurrence integrity.

In vision, an analogous framework partitions tokens into Keep, Target, and Source groups by importance, with Source tokens softly merged to their best Target partners and Keep tokens maintained for maximal fidelity (Ma et al., 18 Jul 2025). Training-free variants for vision use only the structural Mamba $\Delta_t$ gating value for importance, avoiding the need for additional model parameters or retraining (Ma et al., 18 Jul 2025).

3. Domain-Specific MTR Extensions and Algorithmic Variants

Multiple domain-specific adaptations of MTR exist, driven by modality requirements and Mamba's architectural versatility.

Language Mamba (standard SSMs): Post-training intra-layer prune-&-merge via unified importance and similarity metrics (Zhan et al., 2024).
Vision Mamba (ViM, VMamba, etc.): Parameter-free, training-free importance scores (using $\Delta_t$ ), with asymmetric bipartite merging yielding up to 40% FLOPs savings at <2% top-1 accuracy drop on ImageNet-1K. Importance scoring based on internal per-token gating is critical (Ma et al., 18 Jul 2025).
Dynamic and Progressive Schedules: DyVM rearranges kept/pruned tokens to preserve SSM recurrence and introduces per-sample, per-layer dynamic block selection (Wu et al., 7 Apr 2025). Progressive reduction in long-video VLMs uses a low-to-high pruning schedule that intensifies after SSM memory has absorbed sufficient input (Jiang et al., 27 Feb 2026).
Merged Token Re-Training: Pairwise merging with brief retraining (R-MeeTo) enables fast, robust recovery of accuracy at high compression rates via minute-level fine-tuning (Shi et al., 2024).
Cross-Layer Token Fusion: Famba-V explores fusing tokens across selected layers (all, interleaved, lower, or upper), offering tunable trade-offs between efficiency and performance in Vision Mamba (Shen et al., 2024).
Hierarchical and Adaptive Schemes: Coarse-to-fine frameworks first process coarsely patched images and refine only ambiguous regions, adaptively allocating compute as needed to maximize efficiency at iso-accuracy (Liu et al., 29 Nov 2025).
Clustering-Guided Token Reduction: CSSMamba for hyperspectral vision exploits learned spatial clusters and dual-attention driven selection to reduce sequence length per-cluster (often to ≈50%) without degrading boundary delineation or accuracy (Dewis et al., 22 Jan 2026).

4. Empirical Performance: Accuracy, Efficiency, and Trade-offs

MTR consistently demonstrates superior trade-offs compared to baseline token pruning or merging, especially for SSMs. Key findings include:

Model/Setting	Baseline Method	Top-1 Acc Drop	FLOPs Reduction	MTR Acc Drop	MTR Speedup
Mamba-2.7B (Zhan et al., 2024)	PuMer (merge)	–23.0 pp	20%	–4.2 pp (vs. baseline)	1.17×–1.37× (20–30%)
ViM-B (Ma et al., 18 Jul 2025)	HSA (prune)	~2.0%	40%	1.6%	Up to 1.5× throughput
Vim-S (Shi et al., 2024)	HSA (prune)	–1.7%	31%	–1.0% w/ retraining	1.2×–2.2×
Hyb. VLM (Jiang et al., 27 Feb 2026)	Attn (prune)	–3.75	75% (25% kept)	No loss; +1.37 w/finetune	4.1×
CSSMamba (Dewis et al., 22 Jan 2026)	—	<0.5%	≈50% (per-cluster)	—	Marked scan reduction

In all settings, token reduction via MTR ensures that high-importance or novel tokens are retained, merging or pruning only where redundancy or low impact is measurable using SSM-specific metrics.

5. Limitations, Open Issues, and Future Directions

Current MTR techniques carry several limitations and active research directions:

Results largely pertain to post-training reduction; fine-tuning after token reduction can further improve performance (Zhan et al., 2024, Shi et al., 2024, Jiang et al., 27 Feb 2026).
Uniform, global token reduction ratios may not be optimal; adaptive, layerwise, or task-specific reduction schedules are subject to ongoing investigation (Liu et al., 29 Nov 2025).
Some approaches incur non-negligible computational overhead during token similarity calculation (quadratic in token count), although practical implementations amortize this via batching and infrequent invocation (Zhan et al., 2024).
Application to dense prediction tasks (segmentation, detection) remains an open challenge, as order sensitivity and importance scoring may differ by task (Ma et al., 18 Jul 2025).
The extension of MTR to other state-space neural architectures (e.g., S4) or standard attention models is plausible but under-explored.
Joint token reduction and quantization, or scheduled integration at training time, represents a promising efficiency frontier (Zhan et al., 2024, Liu et al., 29 Nov 2025).

6. Practical Integration and Guidelines

MTR modules are integrated as interstitial operations after, before, or within SSM or Vision Mamba blocks and typically do not require model retraining for modest compression rates (Zhan et al., 2024, Ma et al., 18 Jul 2025, Wu et al., 7 Apr 2025):

For plug-and-play application, scoring, sorting, selection, and merging steps are imposed post-block using internal SSM signals (e.g., selective timescale $\Delta_t$ or hidden state activations).
For merged-token with retraining workflows, fine-tuning over a few epochs recovers nearly all accuracy, even with aggressive reduction (Shi et al., 2024).
Cross-layer and adaptive/hierarchical schemes require minimal modifications to standard training loops, but tuning of reduction schedules or grouping strategies is essential for best accuracy-efficiency positioning (Shen et al., 2024, Liu et al., 29 Nov 2025).
The preservation of token order after reduction or merging is mandatory; even small deviations can catastrophically degrade SSM-based models (Shi et al., 2024).

7. Significance and Distinction from Transformer Token Reduction

MTR fundamentally departs from Transformer-centric token reduction in three ways:

The necessity of preserving SSM state continuity and convolutional signal integrity makes arbitrary masking, pruning, or reordering highly detrimental in SSMs, in contrast to the relative robustness of MHAs to such interventions (Zhan et al., 2024).
Importance estimation leverages internal gating or timescale mechanisms unique to Mamba-style SSMs, as opposed to attention scores or [CLS] resemblance (Ma et al., 18 Jul 2025).
The merging strategy accounts for similarity and importance in tandem, avoiding blind information blending—a critical difference for layered recurrent propagation (Zhan et al., 2024, Ma et al., 18 Jul 2025).

As a result, MTR-specific strategies enable efficient, accurate, and practical deployment of SSM-based large models across domains, positioning them as essential tools for modern long-range sequence modeling and high-resolution vision tasks.