Papers
Topics
Authors
Recent
Search
2000 character limit reached

Mamba-Shedder: Structured Model Pruning

Updated 21 February 2026
  • Mamba-Shedder is a structured post‐training compression framework for Selective SSMs that uses metric-based sensitivity analysis to prune redundant architectural units.
  • It employs a training‐free, iterative pruning strategy that targets blocks, SSM modules, attention heads, and MLP channels to improve inference efficiency.
  • Empirical results show up to 1.4× speedup and 10–20% compression with minimal impact on perplexity and accuracy, supporting efficient deployment.

Mamba-Shedder refers to a structured post-training compression framework specifically developed for Selective Structured State Space Models (SSMs), including the Mamba family and its hybrid derivatives. It leverages sensitivity analysis to identify and prune redundant architectural units—blocks, SSM modules, multi-head attention heads (MHAs), MLP sublayers, and channel groups—thus reducing model size and improving inference efficiency with minimal perceptual degradation. The approach is training-free, relying on metric-based importance scores computed over a small calibration set. Mamba-Shedder achieves up to 1.4× speedup and compression ratios of 10–20% while preserving primary accuracy and perplexity metrics (Muñoz et al., 28 Jan 2025).

1. Selective Structured State Space Models: Architectural Foundations

Selective SSMs, such as the original Mamba, are based on discretized linear state-space systems: x˙(t)=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t)\dot{x}(t) = A x(t) + B u(t), \quad y(t) = C^\top x(t) + D u(t) which yield the classic recurrence form after discretization: ht=Aht1+Bxt,yt=Chth_t = A h_{t-1} + B x_t,\qquad y_t = C^\top h_t where ARN×NA \in \mathbb{R}^{N \times N}, BRN×dB \in \mathbb{R}^{N \times d}, and CRN×mC \in \mathbb{R}^{N \times m}. The convolutional implementation exploits the Toeplitz structure of the discrete-time kernel, allowing the SSM to be implemented as a 1D convolution: y=Kxy = K * x, with Kn=CAnBK_n = C^\top A^n B. Mamba innovates on this by introducing time-varying, input-selective SSM parameters using a lightweight selection network.

Variants in the Mamba family include:

  • Mamba-1 (S6 module): Structured diagonalizable AA with learnable complex spectrum.
  • Mamba-2 (SSD module): Scalar–identity AA, plus grouped-value attention (GVA) to mirror Transformer MHA.
  • Hybrids: Zamba (ABAB Mamba–Transformer interleaving), and Hymba (hybrid head with joint SSM and attention)(Muñoz et al., 28 Jan 2025).

Each block typically interleaves:

  • An SSM module (S6/SSD)
  • Pointwise convolution + SiLU
  • Gated MLP All are wrapped within LayerNorm and residual connections in a Transformer-style arrangement.

2. Mamba-Shedder Pruning Methodology

Mamba-Shedder implements two core pruning strategies: block/module-level pruning and MLP channel pruning. The methodology is training-free, utilizing importance scores derived from forward passes over a small (C=256|\mathcal{C}| = 256) calibration set and a metric function ϕ\phi (typically perplexity).

Given model mm and substructures M={Mi}\mathcal{M} = \{M_i\}, the importance score for structure MiM_i: Si=ϕ(mMi,C)ϕ(m,C)S_i = \phi(m \setminus M_i, \mathcal{C}) - \phi(m, \mathcal{C}) A lower SiS_i suggests less impact upon removal, so the unit with minimum SiS_i is pruned iteratively until a desired compression level is reached.

Pruning Algorithms

Algorithm 1 (Block/Module Pruning): Iteratively remove blocks or modules with the lowest importance scores, recalculating at each step.

Algorithm 2 (MLP Channel Pruning): Within MLP blocks, iteratively mask and prune groups of channels with the lowest impact, as measured by changes in the evaluation metric.

These techniques can be cascaded for multi-granularity pruning—blocks, MHAs, MLPs, channel groups, and SSMs.

3. Empirical Compression Results

The primary evaluation benchmarks are LAMBADA perplexity and average zero-shot accuracy across HellaSwag, PIQA, ARC*, WinoGrande, and OBQA via lm-eval-harness. Models assessed include Mamba-2.8B, Mamba2-2.7B, Zamba2-2.7B, Hymba-1.5B, and Falcon-Mamba-7B.

Model Pruning Target Compression ∆PPL ∆AvgAcc
Mamba-2.8B Block (20.86%) 20.86% +3.28 –6.1%
Mamba2-2.7B SSM (16/64 modules) 25% +0.16 –0.4%
Zamba2-2.7B Multi-granularity (10%) 10.27% +1.44 –0.9%
Hymba-1.5B Block (8/32) 25% n/a –3.3%
Falcon-Mamba-7B Block or SSM (10/64) 15.6% +1.82 –5.6%

Notably, Mamba-2.7B and Zamba2-2.7B tolerate removal of up to 16 SSMs (<25%) with minimal impact on perplexity and accuracy. Multi-granularity pruning (blocks + MHA + MLPs + SSMs) achieves the best accuracy–compression trade-off, e.g., Zamba2-2.7B sees only a –0.9% drop in average accuracy at 10% compression (Muñoz et al., 28 Jan 2025).

4. Inference Acceleration and Efficiency

Inference speedups are measured on NVIDIA Tesla V100s. With block pruning, Mamba-2.8B achieves up to 1.29× faster decoding (14/64 blocks pruned). For Mamba2-2.7B, SSM pruning (24/64) improves prefill and decode speeds by up to 1.20× and 1.18×, respectively. In the Zamba2-2.7B hybrid, multi-target pruning gives decode accelerations up to 1.39× (Table 8 of (Muñoz et al., 28 Jan 2025)). The relationship is monotonic: larger pruning ratios yield greater acceleration but increased accuracy loss.

5. Ablation, Sensitivity, and Model-specific Behaviors

Pruning sensitivity is model-specific. Mamba-1 (S6) is block-pruning tolerant, with degradation accelerating beyond 20%. Mamba-2 is more robust to SSM module pruning, maintaining flat perplexity increases until >24/64 SSMs are removed. Hybrids (e.g., Zamba) are sensitive to block-level pruning and require multi-granular, fine-grained removal strategies to prevent abrupt accuracy drops. The most robust configuration involves sequential application of block, MLP, and SSM pruning with recalibration at each step.

A plausible implication is that redundancy is localized to distinct architectural units depending on the specific SSM variant, and thus, tailoring the pruning schedule to model design is essential for optimal results.

6. Recovery Tuning and Practical Recommendations

Recovery tuning is optional and consists of brief (2-epoch) fine-tuning on the calibration set (e.g., Alpaca). This suffices to close the gap to the dense baseline for both perplexity and average accuracy. For example, Zamba2-2.7B pruned at 10.27% with 18 SSMs removed achieves post-tuning accuracy (67.0%) within 1% of the dense model (67.2%).

Best practices for Mamba-Shedder application include:

  1. Define pruning targets (blocks, modules, MLP channels, SSM heads, etc.).
  2. Compute training-free importance scores on a small validation set, using a held-out metric (typically perplexity or accuracy).
  3. Prune the least important unit iteratively, recalculating scores after each removal.
  4. Optionally, perform minimal fine-tuning for recovery.

7. Extension and Impact

Mamba-Shedder is generalizable to any SSM-based or hybrid attention–SSM (e.g., Zamba, Hymba) architecture. Across the Mamba family, 10–15% structural pruning typically yields 1–2% absolute accuracy loss and 1–3 point increases in perplexity. The method enables efficient deployment of large-scale SSM-based models, supporting edge or latency-sensitive inference scenarios without architectural redesign or retraining. The code is publicly available, enhancing the reproducibility and adoption of these compression strategies (Muñoz et al., 28 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-Shedder.