MMMamba: Cross-Modal Fusion for Pan-Sharpening
- MMMamba is a cross-modal fusion framework based on structured state-space models designed for pan-sharpening and zero-shot image enhancement.
- It utilizes a multimodal interleaved scanning mechanism and deep token-level conditioning to achieve robust spatial-spectral correspondence with linear computational complexity.
- Experimental benchmarks on datasets like WV-II show superior PSNR, SSIM, and visual quality compared to conventional CNN, Transformer, and ViT-based methods.
MMMamba refers to a class of cross-modal frameworks and architectures built around structured state-space models (SSMs), specifically variants of the “Mamba” paradigm, for efficient and adaptable fusion of multimodal information. Most prominently, MMMamba denotes the family of architectures for pan-sharpening and zero-shot image enhancement in remote sensing. These models exploit in-context fusion within SSM layers, achieving state-of-the-art results across image fusion benchmarks, while maintaining linear computational complexity. Core innovations include multimodal interleaved scanning mechanisms, deep token-level conditioning, and flexibility for zero-shot generalization to super-resolution tasks. Below, the topic is elucidated across major technical axes and empirical findings, focusing on “MMMamba: A Versatile Cross-Modal In Context Fusion Framework for Pan-Sharpening and Zero-Shot Image Enhancement” (Wang et al., 17 Dec 2025), with cross-references to related Mamba, multimodal, and in-context SSM research.
1. Conceptual Foundation and Objectives
MMMamba models are designed principally for cross-modal fusion, targeting remote sensing pan-sharpening—i.e., reconstructing high-resolution multispectral (HRMS) images by integrating a high-resolution panchromatic (PAN) band and a low-resolution multispectral (MS) stack. Conventional deep networks (CNN, Transformer, ViT) rely on channel concatenation or cross-attention, but such operators either lack adaptive fusion capacity or incur quadratic computational cost ( complexity for sequence length ). MMMamba circumvents these limitations by leveraging in-context SSM fusion: MS and PAN streams are concatenated and processed jointly in linear time () via Mamba blocks. This deep bidirectional token-level interaction enables robust spatial-spectral correspondence extraction and enables additional capabilities such as zero-shot MS super-resolution without retraining.
A notable aspect is the generality of in-context conditioning: when only an MS image is available (e.g., during inference in super-resolution), MMMamba omits the PAN branch but runs identical fusion logic, treating missing modalities as latent in-context inputs and generalizing across application scenarios.
2. High-Level Architecture and Multimodal Interleaved Scanning
MMMamba is built atop Mamba SSMs, which are state-space layers processing sequences via discretized linear recurrences, adapted for images via directional patch scanning. The architecture comprises:
- Shallow modality-specific gated-convolution encoders for MS and PAN (, )
- A stack of MMMamba blocks, each handling fused global token streams
- Local decoder acting only on MS features, followed by residual summation with upsampled LRMS input
Central to fusion is the Multimodal Interleaved (MI) scanning mechanism. At each block:
- MS and PAN features are normalized, linearly projected, and depthwise convolved with SiLU nonlinearity.
- Patch sequences from MS and PAN are extracted in four spatial directions and interleaved, generating directional streams.
- Local-window SSM updates execute fusion per direction.
- Output streams are merged, reprojected per-modality, and merged via pointwise multiplication with the original features.
This MI-scan ensures that each patch-level token sees both modalities from aligned spatial neighborhoods, enhancing cross-modal correspondence and long-range dependency modeling.
3. Mathematical Building Blocks and Complexity Analysis
The general pan-sharpening problem formulation is: where is low-res MS input, is high-res PAN, and is the HRMS output.
The MI-scan fusion operates at each block as: and post-SSM fusion: where is typically SiLU; , , and are learned projections.
Each MMMamba block processes sequences of length (two modalities, image height , width ) at cost, compared to for self-attention. Local windows can replace global SSM kernels to enforce spatial locality and further reduce memory.
4. Algorithmic Pipeline
The MMMamba block pseudocode is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
A_ms = SiLU(DWConv(Linear(LN(F_ms)))) A_p = SiLU(DWConv(Linear(LN(F_p))) Tokenize A_ms, A_p into patches in 4 scan-directions S_int = Interleave(all 4 directional patch sequences) Split S_int into 4 directional streams for k in directions: (S1, S2) = SplitModalityStream(S_int^k) (M1^k, M2^k) = LocalWindowMI-Scan(S1, S2) S_out1 = sum_k M1^k S_out2 = sum_k M2^k F_ms' = LN(S_out1) * A_ms # elementwise F_p' = LN(S_out2) * A_p F_ms_out = reshape(Linear(F_ms')) F_p_out = reshape(Linear(F_p')) return F_ms_out, F_p_out |
The design is modular; MIscan, SSM-conv, and fusion functions are interchangeable to tailor for specific cross-modal or single-modal enhancement.
5. Zero-Shot Image Enhancement
MMMamba achieves zero-shot MS super-resolution by reusing its in-context modality fusion logic. At inference, only the MS image is fed. The MS stream is duplicated in the MI-scan, treating missing PAN as all-zero input, and the network produces a high-resolution MS output absent any retraining or fine-tuning. This extension is enabled by the architecture's deep-token in-context fusion and absence of fixed cross-attention operators, enabling generalization to unseen tasks.
6. Experimental Results and Ablation Findings
Empirical benchmarks demonstrate MMMamba’s superiority:
- WV-II: PSNR=42.31 dB, SSIM=0.9733, SAM=0.0209, ERGAS=0.8888, outperforming Pan-Mamba by 0.08 dB PSNR.
- GF-2, WV-III: gains up to +0.3 dB PSNR.
- Full-resolution GF-2: QNR=0.8312
- Visual outputs: sharper edges and cleaner residuals than Panformer, FAME, and others.
- Zero-shot super-resolution on WV-II: PSNR=36.49 dB, SSIM=0.9114, SAM=0.0299, ERGAS=1.5515; outperforms Bicubic, SFINet++, Pan-Mamba.
Ablation studies on WV-II isolate each module's contribution:
- Mamba SSM vs. self-attention: PSNR drops by 0.9 dB (42.31 41.40 dB)
- In-context deep fusion vs. naive channel-concatenation: PSNR 41.29 dB
- No interleaving: PSNR 36.47 dB—critical token adjacency is lost
- Single-direction scan: 0.2 dB loss
- Local-window SSM: induces spatial locality and marginal gain over global kernel
7. Impact and Extensions
MMMamba establishes a scalable SSM-based paradigm for cross-modal image fusion, pan-sharpening, and zero-shot super-resolution with linear computational complexity. Its in-context conditioning mechanism, multimodal interleaved scanning, and residual fusion logic deliver consistently superior accuracy, robustness, and convergence speed across remote sensing and image enhancement tasks (Wang et al., 17 Dec 2025). These architectural ideas—deep in-context fusion, MI scanning, and linear-time cross-modal interaction—are extensible to other domains, including medical imaging (as seen in GFE-Mamba (Fang et al., 2024), MVSMamba (Jiang et al., 3 Nov 2025), and Mamba-driven U-Net derivatives (Bansal et al., 2024)), and general multimodal reasoning frameworks in the vision–language modeling landscape (Liao et al., 18 Feb 2025, Lu et al., 15 Oct 2025).
A plausible implication is that MMMamba’s core architectural inventions—modality-interleaved sequence processing and generic in-context fusion—provide a path forward for multimodal AI with unified, memory-efficient, and flexible state-space mechanisms that are broadly superior to quadratic-attention Transformer analogs for long-context and high-resolution tasks.