MambaIR: State-Space Model for Restoration

Updated 17 September 2025

MambaIR is a state-space model that restores images by combining global receptive field modeling with linear computational complexity.
It integrates local pixel enhancement and channel attention to mitigate issues like local pixel forgetting and channel redundancy.
Quantitative tests show MambaIR outperforms models like SwinIR by up to 0.45 dB PSNR under comparable computational budgets.

MambaIR is a state-space model–based image restoration framework designed to balance global receptive field modeling and computational efficiency for low-level vision tasks such as super-resolution and denoising. Leveraging the Selective Structured State Space Model (Mamba), MambaIR overcomes the typical limitations of CNNs and Transformers—namely limited receptive field for CNNs and quadratic computational cost for Transformers—by providing global context with linear complexity. The model architecture introduces key enhancements, specifically local pixel enhancement and channel attention, to address the "local pixel forgetting" and "channel redundancy" inherent to vanilla Mamba when applied to visual data. Quantitative and qualitative experiments substantiate its superiority, with MambaIR outperforming models like SwinIR by up to 0.45 dB PSNR under similar computational budgets and exhibiting a larger effective receptive field.

1. Theoretical Motivation and Context

MambaIR arises from the need to bridge the gap between existing visual backbone architectures—CNNs, which struggle to capture long-range dependencies due to local convolution, and Transformers, which provide global modeling via self-attention at a prohibitive quadratic cost. MambaIR is based on the Mamba state-space model, which exhibits linear scaling with input size and excels in modeling long-range dependencies by sequential 1D processing. However, naively flattening 2D images into 1D sequences causes dissociation of spatially adjacent pixels (local pixel forgetting) and requires many hidden states to maintain global context (channel redundancy). MambaIR resolves these constraints through architectural modifications, aiming to maximize both locality and global awareness with efficient computation.

2. Architectural Composition and Innovations

The architecture of MambaIR consists of three principal stages:

Shallow Feature Extraction: Initial image features are extracted using a 3×3 convolutional layer.
Deep Feature Extraction: A series of Residual State-Space Groups (RSSGs), each composed of several Residual State-Space Blocks (RSSBs), processes the features. The RSSB is the central architectural innovation, described below.
Reconstruction: Aggregates shallow and deep features to restore high-quality images.

RSSB Design

Each RSSB addresses both local and global limitations of state-space modeling for images:

Vision State-Space Module (VSSM):
- Applies layer normalization and processes flattened features through a depthwise convolution branch followed by 2D selective scan and layer normalization (for global context) and in parallel, a linear layer with SiLU activation.
- These branches are fused using a Hadamard product and projected back to the original channel dimensionality.
Local Enhancement:
- After VSSM, a bottleneck convolution branch (compressing and then expanding channels by a factor γ) is used to restore neighborhood pixel interactions lost during 1D scanning.
Channel Attention:
- Following local enhancement, a channel attention (CA) mechanism akin to squeeze-and-excitation selectively scales channels to suppress redundancy and amplify discriminative information.
Learnable Scaling and Residuals:
- Two learnable scale factors (s and s′) weight the skip connections for balanced feature integration:
- Intermediate: $Z^l = \mathrm{VSSM}(\mathrm{LN}(F_D^l)) + s\cdot F_D^l$
- Output: $F_D^{l+1} = \mathrm{CA}(\mathrm{Conv}(\mathrm{LN}(Z^l))) + s'\cdot Z^l$

These enhancements directly address the architectural deficiencies of vanilla Mamba in the vision domain.

3. Comparison with Contemporary Restoration Backbones

MambaIR surpasses both CNN- and Transformer-based models with respect to global modeling and computational efficiency:

CNN Backbones: Effective receptive field is typically local due to the finite kernel size. Achieving a comparable global field requires deep stacking and increased parameters.
Transformer Backbones (e.g., SwinIR): Global context is modeled via patch-based self-attention, incurring quadratic cost or limiting global interaction through windowed attention.
MambaIR: Achieves a global receptive field with linear computational complexity, directly activating all pixels in the image without sacrificing efficiency.

On standard super-resolution benchmarks, MambaIR delivers up to 0.45 dB greater PSNR than SwinIR with similar compute, demonstrating its efficacy across image scales and datasets (e.g., Manga109).

4. Experimental and Ablation Studies

Extensive experiments validate the design:

Super-resolution tasks (scales ×2, ×3, ×4): MambaIR demonstrates consistently higher PSNR/SSIM than EDSR, RCAN, and SwinIR.
Denoising tasks: Gaussian and real image denoising comparisons with DRUNet and Restormer indicate competitive or improved PSNR.
Ablation on RSSB: Removal of the convolution branch or substituting it with an MLP results in measurable performance degradation, confirming the necessity of local enhancement. Absence of channel attention also reduces performance, highlighting channel redundancy suppression as critical.
Effective Receptive Field Visualizations: Show that MambaIR uniquely provides a truly global receptive field, which sets it apart from other methods.

Table: Representative Quantitative Gains

Model	Task / Dataset	PSNR Improvement
MambaIR	SR / Manga109	+0.45 dB (vs. SwinIR)
MambaIR	Denoising / Various	Higher than DRUNet, Restormer

5. Implementation and Scalability Considerations

Computational Complexity: Linear in the sequence/image size owing to selective state-space modeling; scaling the image merely increases computation proportionally.
Memory and Resource Usage: Comparable or lower to Transformer-based backbones for equivalent performance due to the use of linear operations and effective suppression of channel redundancy.
Deployment: Well-suited for both high-performance server-side processing and edge deployments in consumer photography pipelines, due to its efficiency and scalability.
Limitations and Future Directions: The paper emphasizes exploration into more advanced 2D selective scan strategies and tighter redundancy control of state representations as promising directions. Further refinement of locality modeling and scan strategy may further improve performance and efficiency, particularly for very high-resolution imagery.

6. Real-World Applications and Impact

Super-resolution: Suitable for demanding use-cases like remote sensing, medical imaging enhancement, and consumer photo upscaling.
Denoising: Effective for camera pipelines and restoration of real-world noisy images.
JPEG Artifact Reduction: Directly relevant for compression artifact removal in practical systems.
General Restoration: Adaptable for related restoration tasks such as deblurring and inpainting where global context is important.

This paradigm introduces a new class of efficient, globally-aware visual backbones, positioning state-space models as viable contenders with CNNs and Transformers for low-level vision tasks. MambaIR is foundational for ongoing research, with subsequent work suggesting refinements in scan strategy, SSM discretization (such as first-order hold conditions), and non-causal modeling extensions to further push state-of-the-art results.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MambaIR.