Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba-Shedder: Efficient SSM Pruning

Updated 9 February 2026
  • Mamba-Shedder is a structured compression framework for selective SSM architectures, using training-free, evaluation-guided pruning to remove redundant modules.
  • It iteratively eliminates unnecessary blocks, SSM heads, and MLP channels based on calibration metrics, preserving dense matrix operations for hardware efficiency.
  • Empirical results show that pruning ~20% of modules can achieve a 10–20% model size reduction and 1.3–1.4× inference speedup with only minor accuracy drops.

Mamba-Shedder refers to a structured compression framework for Selective Structured State Space Model (SSM) architectures, primarily the Mamba and its hybrid variants, offering post-training pruning of redundant components to achieve significant reductions in model size and computational requirements. Mechanistically, Mamba-Shedder operates by exploiting structure and redundancy at multiple module granularities, guided by calibration metrics, resulting in accelerated inference and minimal impact on downstream accuracy—even for sequence modeling networks previously regarded as highly parameter efficient. The framework is model-agnostic within the SSM/Transformer spectrum and is fully training-free; optional fine-tuning can recover accuracy lost during pruning. Mamba-Shedder constitutes a principled methodology for hardware-aware automated model compression in contemporary large sequence models (Muñoz et al., 28 Jan 2025).

1. Background: SSMs, Mamba Architectures, and Hybrids

Structured State Space Models (SSMs) generalize autoregressive sequence models by maintaining a fixed-dimensional hidden state htRdh_t \in \mathbb{R}^d and updating it linearly via

ht=Aht1+Bxt,yt=Cht,h_t = A h_{t-1} + B x_t, \quad y_t = C^\top h_t,

with A,B,CA,B,C learnable, time-invariant parameters. SSMs yield linear dependence of inference cost on sequence length and do not require growing cache memories as in Transformer-based models.

Mamba architectures extend SSMs into the selective (input-dependent) domain by parameterizing A,B,CA, B, C as functions of the input xtx_t, increasing expressivity: ht=A(xt)ht1+B(xt)xt,yt=C(xt)ht.h_t = A(x_t)\,h_{t-1} + B(x_t)\,x_t, \quad y_t = C(x_t)^\top h_t. These blocks are further interleaved with gated MLPs, depthwise convolutions, and SiLU activations. Mamba-2 (SSD core) replaces AA by αI\alpha\,I (scalar times identity) to optimize memory and hardware access, also deploying multiple SSM “heads” in a grouped configuration. Hybrid models, such as Zamba and Hymba, combine Mamba blocks with Transformer layers, using various shared and parallel attention structures.

2. Compression Methodology in Mamba-Shedder

Mamba-Shedder pursues structured, training-free pruning. Modules or channel groups judged least critical (per calibration-set evaluation) are iteratively eliminated. Supported granularities include:

  • Entire Mamba or Transformer blocks.
  • SSM heads or submodules within blocks.
  • Transformer subblocks: MLPs, multi-head attention units.
  • Channel groups in the MLP (“width pruning”), preserving dense matrix structures for hardware efficiency.

The importance of module MiM_i is assessed as

Si=ϕ(mMi;C)ϕ(m;C),S_i = \phi(m \setminus M_i; \mathcal{C}) - \phi(m; \mathcal{C}),

where ϕ\phi is the average perplexity or accuracy on a calibration set C\mathcal{C} (typically 256 Alpaca instruction-tuning samples). Modules with lowest SiS_i are pruned first in sequential passes, using simple iterative algorithms. Finer-grained strategies remove channel groups, e.g., consecutive width-gg units in MLPs, based on minimum impact as scored by the above metric.

3. Formal Model Evolution Under Pruning

Pruning of blocks and channels modifies the global SSM by reducing the number of parallel heads or shrinking the width of MLPs:

  • For HH total SSM heads, removing rr results in: ht=k=1HrAkht1+Bkxt,yt=k=1HrCkht,h_t = \sum_{k=1}^{H-r} A_k h_{t-1} + B_k x_t, \quad y_t = \sum_{k=1}^{H-r} C_k^\top h_t, with associated reductions in both memory and FLOPs.
  • Channel pruning in MLPs directly shrinks the hidden size ddgd \to d-g for groups of width gg.
  • In hybrid Mamba/Transformer layers (e.g., Zamba, Hymba), block or subblock removal systematically eliminates selected computation paths.

For “grouped value attention” in Mamba-2, pruning an SSM head nullifies its projection matrix ChC_h.

4. Efficiency, Accuracy, and Hardware Implications

Mamba-Shedder pruning enables explicit control over the trade-off among parameter count, inference speed, and downstream accuracy, as outlined below.

Model Pruning Type Prune Fraction ΔPPL ΔAvg-Acc Inference Speedup
Mamba-2.8B block 14/64 (20.9%) +3.28 –6.1% 1.29×
Mamba2-2.7B SSM module 20/64 (31%) +1.79 –1.6% 1.18×
Zamba2-2.7B mixed 15.5% +1.17 –1.3% 1.39×
Falcon-Mamba block 10/64 +1.82 –5.6% N/A

Pruning up to \sim20% of blocks or submodules yields 10-20% model size reduction and ~1.3–1.4× speedup, with single-digit relative accuracy drops. Hardware-friendliness is achieved by restricting pruning to entire blocks or contiguous groups, maintaining dense GEMM patterns and avoiding slow sparse kernels.

A post-pruning brief fine-tuning (1–2 epochs) can restore nearly all lost performance (e.g., Mamba2-2.7B: post-prune ΔPPL +1.79 \to +0.34 after tuning, Δacc \to +1.0%).

5. Empirical Results and Evaluation Tasks

Mamba-Shedder was evaluated on:

  • Autoregressive language modeling using Lambada (perplexity).
  • Zero-shot reasoning benchmarks (HellaSwag, PIQA, ARC-easy, ARC-challenging, WinoGrande, OBQA).

Pruning studies (training-free and post-fine-tuning) were conducted on Mamba-2.8B, Mamba2-2.7B, Zamba2-2.7B, Falcon-Mamba-7B, and Hymba-1.5B. Core findings include:

  • Block and head pruning individually or jointly reduce inference time by up to 1.4×.
  • Hybrid (Mamba + Transformer) pruning strategies further amplify efficiency with only moderate impact on accuracy.
  • Recovery-tuned pruned models match or approach original baseline performance.

This operationalizes the discovery that substantial “structural redundancy” remains even in state-of-the-art SSM sequence models (Muñoz et al., 28 Jan 2025).

6. Deployment Guidelines and Best Practices

Mamba-Shedder's workflow recommendations are as follows:

  • Employ \sim256 calibration samples for importance estimation.
  • Set MLP channel group size gg (e.g., g=1024g=1024 for Zamba2) to balance computational granularity and pruning impact.
  • Prune sequentially, starting from coarser units (blocks) toward finer ones (SSM heads, MLP channels), ceasing at \sim20% for low impact (<3%<3\% avg-accuracy loss).
  • Restrict pruning to entire dense modules/contiguous groups to maximize hardware acceleration.
  • Post-prune, conduct 1–2 epochs’ fine-tuning on in-domain (calibration or downstream) data.

7. Significance and Implications

Mamba-Shedder demonstrates that systematic, evaluation-guided, training-free structured pruning is feasible and effective for SSM-based and hybrid sequence models. The methodology enables practitioners to realize practical acceleration with minimal loss of accuracy, simplified by hardware-friendly, dense-matrix preserving modifications. This suggests a fundamental “structural redundancy” present even in highly optimized post-Transformer models, facilitating their adaptation to deployment settings with strict efficiency or latency requirements (Muñoz et al., 28 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-Shedder Framework.