Hierarchical Mamba Module (HMM)

Updated 24 November 2025

The Hierarchical Mamba Module (HMM) is a neural architecture that integrates global state-space modeling, Fourier analysis, and local convolutions for multi-scale feature learning.
It employs dual spatial and frequency domain branches to capture long-range dependencies while preserving local details in image restoration tasks.
Empirical studies show that HMM enhances PSNR and SSIM in applications like image deraining and MR image super-resolution with lower computational overhead.

The Hierarchical Mamba Module (HMM) is a specialized neural architecture devised to facilitate robust multi-scale feature learning by integrating global contextual modeling with local detail recovery, utilizing advances in state-space models (SSM), Fourier-domain processing, and attention-based prior fusion. This module has been deployed in recent high-performance computer vision systems for tasks such as image deraining and multi-modality medical image super-resolution, notably as the backbone of the Multi-Prior Hierarchical Mamba (MPHM) frameworks (Yu et al., 17 Nov 2025, Ji et al., 14 Apr 2025).

1. Foundational Design and Motivation

The HMM is characterized by a dual-path architecture designed to address the limitations of traditional convolutional and Transformer-based models, especially with respect to the fixed local receptive fields of CNNs and the computational burden of global attention. HMMs hierarchically combine state-space-model-based attentionless global modeling (specifically, Mamba blocks) with Fourier-based frequency domain analysis and efficient local convolutions. This synergy enables precise long-range dependency modeling and edge-preserving local feature enhancement with linear computational complexity in input size.

Within the overall MPHM system, the HMM backbone serves as the core encoder–decoder structure, refined at each scale by auxiliary prior information when used in multi-prior systems (Yu et al., 17 Nov 2025). The hierarchical, staged deployment of HMMs enables progressive abstraction and restoration across multiple feature resolutions.

2. Architectural Components and Data Flow

Hierarchical Mamba Modules process input features via two parallel paths:

Spatial-domain branch:
- Input feature channels are split into four groups.
- Two groups are processed by Visual Selective Spatial Mamba (VSSM) blocks, which apply global SSM-based modeling.
- The other two groups undergo depthwise 3×3 convolution for local spatial detail retention.
- The processed groups are regrouped and subjected to a further round of global-local fusion.
Frequency-domain branch:
- Uses a learnable complex-weighted Fast Fourier Convolutional Module (FFCM). Given an input feature $F_{in}$ , the operation is
$F_{fre} = \text{IFFT}\bigl( \text{FFT}(F_{in}) \odot W \bigr)$

where $W \in \mathbb{C}^{C \times H \times W}$ denotes a trainable frequency-domain reweighting.
Fusion and Residual Connection:
- The outputs of the spatial and frequency branches are concatenated across the channel dimension, projected with a $1\times1$ convolution, and a skip connection adds the input:
$F_{out} = F_{in} + \text{Conv}_{1\times1}\left([\ F_{spa}\,,\,F_{fre}\ ]\right)$

This module is repeatedly applied at each stage of both encoders and decoders, with varying depth per stage (e.g., {4, 6, 8, 6, 4} blocks across five U-Net scales in image deraining) (Yu et al., 17 Nov 2025).

3. Integration with Multi-Prior Guidance

While the HMM can function independently, leading systems employ it within architectures that fuse task-specific priors at multiple decoding levels. For instance, the Multi-Prior Hierarchical Mamba (MPHM) for deraining operates as follows:

Macro-semantic priors: Sentence-level cues (e.g., "No rain") are encoded using a frozen CLIP text encoder and adapted to the feature space by a bottleneck adapter and cross-attention.
Micro-structural visual priors: Features derived from a frozen DINOv2 visual encoder are adapted and upsampled to match feature dimensions.
Progressive Priors Fusion Injection (PFI): At every decoder stage, priors are injected via a two-stage cross-attention (DINOv2 then CLIP), a self-attention layer, and a Gated Depth-wise Feed-Forward Network (GDFN). A learnable, data-dependent scalar $\alpha_l$ modulates the injection at each scale:

$F^{l+1} = \text{Decoder}^{l}(F^{l}) + \alpha_{l}\,\psi\bigl(F^{l},\,T,\,V\bigr)$

where $T$ and $V$ are the adapted CLIP and DINOv2 priors, respectively (Yu et al., 17 Nov 2025).

This suggests that HMMs are especially effective when exploited at the intersection of internal architectural hierarchy and external domain or task priors, providing adaptive, scale-specific guidance for complex restoration tasks.

4. State-Space Mamba Formulation in HMMs

Mamba blocks within HMMs adopt a discretized linear continuous-time SSM formulation: $\frac{d\,h(t)}{dt} = A h(t) + B x(t), \qquad y(t) = C h(t)$ Discretization yields: $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t$ with $\bar{A}, \bar{B}$ parameterized via a Zero-Order Hold mechanism as functions of trainable matrices $A$ and $B$ , and of a learned timescale $\Delta$ .

Within global modeling, feature maps are unfolded into patch sequences for SSM processing along multiple scan directions, enabling both intra- and inter-patch dependency modeling. Local Mamba variants process shorter, local sequences (e.g., per quadrant), thereby emphasizing short-range structure (Ji et al., 14 Apr 2025).

5. Empirical Impact and Ablation Analyses

Systematic ablations have demonstrated HMM’s centrality to high-fidelity reconstruction:

Removing the frequency-domain path (FFCM) reduces PSNR by over 2.5 dB on Rain200H (Yu et al., 17 Nov 2025).
Omitting depthwise convolutions (local path) modestly reduces PSNR but increases FLOPs by 20%.
The branch merging strategy critically affects outcomes: channelwise concatenation plus a $1\times1$ convolution yields higher PSNR than direct addition or cross-attention for path fusion.
Progressive fusion of both semantic and visual priors yields additional quality improvements over single-stage or unimodal prior injection.

The dual-path, global-local structure is therefore vital for both efficient context integration and recovery of high-frequency image content. The module’s linear time complexity in sequence length and quadratic complexity in hidden dimension further supports its scalability for large inputs and deep hierarchies (Ji et al., 14 Apr 2025).

6. Comparative Performance and Application Domains

HMM-based models surpass prior state-of-the-art benchmarks in both synthetic and real-world tasks:

On Rain200H, MPHM with HMM backbone achieves 33.53 dB PSNR and 0.9475 SSIM, a +0.57 dB gain over previous leaders in single-image deraining (Yu et al., 17 Nov 2025).
In multi-modality MR image super-resolution, the GLMamba (a two-branch HMM-variant) achieves high performance with $\sim$ 1.2M parameters and 32 GFLOPs, outperforming more resource-intensive Transformer counterparts (Ji et al., 14 Apr 2025).
These results are robust to real-world scenarios, with lowest BRISQUE (21.22) and NIQE (3.787) scores reported on unpaired Internet RE-RAIN data.

A plausible implication is that the modular, hierarchical, and dual-domain nature of HMMs generalizes beyond image deraining and medical image SR, forming a template applicable to other vision restoration, enhancement, and multi-modal tasks.

7. Loss Functions and Training Regimens

Training strategies for architectures with HMMs integrate data-appropriate objective terms:

Reconstruction loss: $\ell_1$ or similar pixel-wise discrepancies between predicted and reference images.
Contrastive or domain-specific regularizations: Frequency-domain contrastive loss (Yu et al., 17 Nov 2025) or contrastive edge loss (CELoss) involving Laplacian kernels for edge enhancement (Ji et al., 14 Apr 2025).
Loss functions are typically balanced via fixed or learned coefficients to optimize sharpness, perceptual quality, and consistency across modalities or priors.
Optimizers are Adam with learning rates tuned by cosine annealing or similar schedules; batch sizes and crop sizes vary according to application and computational resource constraints.

This approach has demonstrated effectiveness in achieving both high quantitative gains and visually plausible outputs in challenging low-level vision settings.

PDF Markdown Chat (Pro)

References (2)

Semantics and Content Matter: Towards Multi-Prior Hierarchical Mamba for Image Deraining (2025)

Global and Local Mamba Network for Multi-Modality Medical Image Super-Resolution (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Mamba Module (HMM).