Mamba-Shedder: Structured Model Pruning
- Mamba-Shedder is a structured post‐training compression framework for Selective SSMs that uses metric-based sensitivity analysis to prune redundant architectural units.
- It employs a training‐free, iterative pruning strategy that targets blocks, SSM modules, attention heads, and MLP channels to improve inference efficiency.
- Empirical results show up to 1.4× speedup and 10–20% compression with minimal impact on perplexity and accuracy, supporting efficient deployment.
Mamba-Shedder refers to a structured post-training compression framework specifically developed for Selective Structured State Space Models (SSMs), including the Mamba family and its hybrid derivatives. It leverages sensitivity analysis to identify and prune redundant architectural units—blocks, SSM modules, multi-head attention heads (MHAs), MLP sublayers, and channel groups—thus reducing model size and improving inference efficiency with minimal perceptual degradation. The approach is training-free, relying on metric-based importance scores computed over a small calibration set. Mamba-Shedder achieves up to 1.4× speedup and compression ratios of 10–20% while preserving primary accuracy and perplexity metrics (Muñoz et al., 28 Jan 2025).
1. Selective Structured State Space Models: Architectural Foundations
Selective SSMs, such as the original Mamba, are based on discretized linear state-space systems: which yield the classic recurrence form after discretization: where , , and . The convolutional implementation exploits the Toeplitz structure of the discrete-time kernel, allowing the SSM to be implemented as a 1D convolution: , with . Mamba innovates on this by introducing time-varying, input-selective SSM parameters using a lightweight selection network.
Variants in the Mamba family include:
- Mamba-1 (S6 module): Structured diagonalizable with learnable complex spectrum.
- Mamba-2 (SSD module): Scalar–identity , plus grouped-value attention (GVA) to mirror Transformer MHA.
- Hybrids: Zamba (ABAB Mamba–Transformer interleaving), and Hymba (hybrid head with joint SSM and attention)(Muñoz et al., 28 Jan 2025).
Each block typically interleaves:
- An SSM module (S6/SSD)
- Pointwise convolution + SiLU
- Gated MLP All are wrapped within LayerNorm and residual connections in a Transformer-style arrangement.
2. Mamba-Shedder Pruning Methodology
Mamba-Shedder implements two core pruning strategies: block/module-level pruning and MLP channel pruning. The methodology is training-free, utilizing importance scores derived from forward passes over a small () calibration set and a metric function (typically perplexity).
Given model and substructures , the importance score for structure : A lower suggests less impact upon removal, so the unit with minimum is pruned iteratively until a desired compression level is reached.
Pruning Algorithms
Algorithm 1 (Block/Module Pruning): Iteratively remove blocks or modules with the lowest importance scores, recalculating at each step.
Algorithm 2 (MLP Channel Pruning): Within MLP blocks, iteratively mask and prune groups of channels with the lowest impact, as measured by changes in the evaluation metric.
These techniques can be cascaded for multi-granularity pruning—blocks, MHAs, MLPs, channel groups, and SSMs.
3. Empirical Compression Results
The primary evaluation benchmarks are LAMBADA perplexity and average zero-shot accuracy across HellaSwag, PIQA, ARC*, WinoGrande, and OBQA via lm-eval-harness. Models assessed include Mamba-2.8B, Mamba2-2.7B, Zamba2-2.7B, Hymba-1.5B, and Falcon-Mamba-7B.
| Model | Pruning Target | Compression | ∆PPL | ∆AvgAcc |
|---|---|---|---|---|
| Mamba-2.8B | Block (20.86%) | 20.86% | +3.28 | –6.1% |
| Mamba2-2.7B | SSM (16/64 modules) | 25% | +0.16 | –0.4% |
| Zamba2-2.7B | Multi-granularity (10%) | 10.27% | +1.44 | –0.9% |
| Hymba-1.5B | Block (8/32) | 25% | n/a | –3.3% |
| Falcon-Mamba-7B | Block or SSM (10/64) | 15.6% | +1.82 | –5.6% |
Notably, Mamba-2.7B and Zamba2-2.7B tolerate removal of up to 16 SSMs (<25%) with minimal impact on perplexity and accuracy. Multi-granularity pruning (blocks + MHA + MLPs + SSMs) achieves the best accuracy–compression trade-off, e.g., Zamba2-2.7B sees only a –0.9% drop in average accuracy at 10% compression (Muñoz et al., 28 Jan 2025).
4. Inference Acceleration and Efficiency
Inference speedups are measured on NVIDIA Tesla V100s. With block pruning, Mamba-2.8B achieves up to 1.29× faster decoding (14/64 blocks pruned). For Mamba2-2.7B, SSM pruning (24/64) improves prefill and decode speeds by up to 1.20× and 1.18×, respectively. In the Zamba2-2.7B hybrid, multi-target pruning gives decode accelerations up to 1.39× (Table 8 of (Muñoz et al., 28 Jan 2025)). The relationship is monotonic: larger pruning ratios yield greater acceleration but increased accuracy loss.
5. Ablation, Sensitivity, and Model-specific Behaviors
Pruning sensitivity is model-specific. Mamba-1 (S6) is block-pruning tolerant, with degradation accelerating beyond 20%. Mamba-2 is more robust to SSM module pruning, maintaining flat perplexity increases until >24/64 SSMs are removed. Hybrids (e.g., Zamba) are sensitive to block-level pruning and require multi-granular, fine-grained removal strategies to prevent abrupt accuracy drops. The most robust configuration involves sequential application of block, MLP, and SSM pruning with recalibration at each step.
A plausible implication is that redundancy is localized to distinct architectural units depending on the specific SSM variant, and thus, tailoring the pruning schedule to model design is essential for optimal results.
6. Recovery Tuning and Practical Recommendations
Recovery tuning is optional and consists of brief (2-epoch) fine-tuning on the calibration set (e.g., Alpaca). This suffices to close the gap to the dense baseline for both perplexity and average accuracy. For example, Zamba2-2.7B pruned at 10.27% with 18 SSMs removed achieves post-tuning accuracy (67.0%) within 1% of the dense model (67.2%).
Best practices for Mamba-Shedder application include:
- Define pruning targets (blocks, modules, MLP channels, SSM heads, etc.).
- Compute training-free importance scores on a small validation set, using a held-out metric (typically perplexity or accuracy).
- Prune the least important unit iteratively, recalculating scores after each removal.
- Optionally, perform minimal fine-tuning for recovery.
7. Extension and Impact
Mamba-Shedder is generalizable to any SSM-based or hybrid attention–SSM (e.g., Zamba, Hymba) architecture. Across the Mamba family, 10–15% structural pruning typically yields 1–2% absolute accuracy loss and 1–3 point increases in perplexity. The method enables efficient deployment of large-scale SSM-based models, supporting edge or latency-sensitive inference scenarios without architectural redesign or retraining. The code is publicly available, enhancing the reproducibility and adoption of these compression strategies (Muñoz et al., 28 Jan 2025).