Multi-Prior Hierarchical Mamba Network
- MPHM network is an architectural paradigm that fuses complementary semantic and structural priors within a hierarchical Mamba backbone to enhance image restoration and medical imaging tasks.
- The design leverages dual-path hierarchical modules and attention-based prior injection to capture multi-scale, domain-rich contextual features, achieving superior PSNR and classification accuracy.
- Empirical evaluations in image deraining and CECT tumor subtyping demonstrate significant gains, such as improved PSNR (+0.57 dB) and 97.4% accuracy, validating its robust multi-prior integration.
The Multi-Prior Hierarchical Mamba (MPHM) network is an architectural paradigm that systematically integrates heterogeneous prior knowledge within a hierarchical Mamba backbone to advance both image restoration and medical image analysis tasks. Across its instantiations in image deraining (Yu et al., 17 Nov 2025) and multi-phase contrast-enhanced CT tumor subtyping (Gong et al., 16 Sep 2025), MPHM is characterized by its fusion of complementary priors—semantic and structural—and its use of dual-path hierarchical modules (often leveraging the Mamba architecture) to capture multi-scale and domain-rich contextual information. MPHM achieves state-of-the-art performance in both domains, with rigorous ablation and evaluation protocols demonstrating the advantage of macro-micro prior integration and hierarchical dual-domain modeling.
1. Architectural Overview and Rationale
The defining architectural principle of MPHM is the fusion of distinct and complementary priors at multiple abstraction levels within a hierarchical network backbone.
Deraining Context: In single-image deraining, the objective is to separate rain streaks from scene content—requiring both macro-level semantic understanding (e.g., the concept “no rain on a car”) and micro-level structural discrimination (e.g., fine edges and textures). MPHM addresses these requirements using global semantic priors from the CLIP text encoder (prompted with “No rain”) and detailed structural priors from a frozen DINOv2 visual encoder. These priors are injected at each stage of a five-level U-shaped encoder-decoder network, whose backbone comprises Hierarchical Mamba Modules (HMM) operating in both spatial and frequency domains (Yu et al., 17 Nov 2025).
CECT Tumor Subtyping Context: In contrast-enhanced CT (CECT) analysis for tumor subtyping, MPHM as instantiated in CECT-Mamba processes spatial and temporal contrast patterns across phases (arterial, venous, delayed). The architecture integrates a 3D-CNN with a Dual-Hierarchical Contrast-enhanced-aware Mamba (DHCM) encoder, featuring both spatial and temporal tokenization and contrast-guided refinement (Gong et al., 16 Sep 2025).
The following table summarizes MPHM’s macro-architecture in its two major application domains:
| Instantiation | Prior Types | Backbone Module | Injection/Fusion Scheme |
|---|---|---|---|
| Deraining (Yu et al., 17 Nov 2025) | CLIP (text), DINOv2 (visual) | HMM (dual-domain) | Priors Fusion Injection (PFI) |
| CECT-Mamba (Gong et al., 16 Sep 2025) | Spatial/temporal context tokens | DHCM (spatial/temp) | Dual-hierarchical, SGR, MGF |
2. Prior Fusion and Injection Mechanisms
A key innovation in MPHM is the progressive, level-wise injection of priors into the decoder or encoder streams via attention-based modules.
Image Deraining – Priors Fusion Injection (PFI):
- At each decoder level , both adapted visual prior (from DINOv2) and textual prior (from CLIP) are injected into the decoder feature by sequential application of cross-attention (first with , then ), followed by self-attention and Gated Depth-wise Feedforward refinement.
- Mathematical formalization:
- This design contrasts with naive fusion (addition or concatenation), which empirically yields lower performance (e.g., PSNR 31.95 dB vs. 33.53 dB for hierarchical PFI on Rain200H (Yu et al., 17 Nov 2025)).
CECT-Mamba – Dual-Hierarchical Tokenization:
- At each encoder stage, feature maps are simultaneously processed via two tokenization streams:
- Spatial sampling: patches extracted within each CECT phase, concatenated across phases.
- Temporal sampling: per-voxel feature concatenation across the three phases.
- Temporal features with high inter-phase change are refined with a dedicated Mamba module (Similarity-Guided Refinement, SGR) before fusion with spatial branch outputs (Gong et al., 16 Sep 2025).
3. Hierarchical Mamba and Dual-Domain Feature Modeling
MPHM architectures exploit Mamba-based modules to model long-range dependencies and fine local features through specialized hierarchical designs.
Hierarchical Mamba Module (HMM) – Deraining:
- Operates with bifurcated spatial- and frequency-domain branches.
- Spatial branch: Channel-split features processed via Visual Selective Spatial Mamba (VSSM) blocks and depth-wise convolutions, fused and refined through further Mamba and convolution layers.
- Frequency branch: 2D FFT of input is passed through a lightweight Frequency-domain Feature Coupling Module (FFCM).
- The outputs of the two branches are concatenated, projected, and added residually:
- Ablations demonstrate that both frequency-coupling (−2.50 dB PSNR if omitted) and depth-wise convolution are critical to restoration quality (Yu et al., 17 Nov 2025).
Dual-Hierarchical Mamba – CECT-Mamba:
- The DHCM block processes spatial-patch tokens and temporally sampled tokens (across CECT phases) via independent Mamba pathways.
- Temporal token refinement is channelled through SGR, focusing modeling capacity on regions of highest inter-phase variability, vital for discriminative tumor subtyping.
- Downsampling and multi-hierarchical feature extraction mirror U-Net and UNETR designs (Gong et al., 16 Sep 2025).
4. Training Objectives, Losses, and Hyperparameters
Deraining (Yu et al., 17 Nov 2025):
- Total loss combines an pixel-space reconstruction loss and a frequency-domain contrastive regularization component:
with
where denotes the Discrete Fourier Transform and are random negatives.
- Training uses Adam optimizer, cosine-annealing from to , patch crops sized , batch size 4, and backbone module depths (Yu et al., 17 Nov 2025).
CECT-Mamba (Gong et al., 16 Sep 2025):
- Objective is standard cross-entropy over tumor class labels (PDAC, PNET); no contrastive or frequency-based regularization is applied.
- Optimization uses Adam, initial learning rate , cosine decay to over 100 epochs, batch size 4, with heavy data augmentation (including random masking of up to 50% tokens in the spatial path).
- Preprocessing entails phase cropping (), intensity normalization, and ROI localization via nnU-Net.
5. Empirical Performance and Ablation Analyses
MPHM consistently advances the empirical state of the art in its target domains.
Deraining Results (Yu et al., 17 Nov 2025):
- Achieves PSNR 33.53 dB (+0.57 dB over TransMamba, +1.05 dB over FADformer) on Rain200H.
- Delivers perceptual improvements on real-world data: BRISQUE improves from 21.67 (NeRD-Rain) to 21.22, and NIQE from 3.84 to 3.79.
- Outperforms ablated configurations (e.g., single-prior or naive fusion), with both priors yielding PSNR 33.53 dB/SSIM 0.9475 versus 33.06 dB/0.9421 with no priors.
CECT-Mamba Results (Gong et al., 16 Sep 2025):
- On an in-house 270-patient dataset, achieves 97.4% accuracy and 98.6% AUC for PDAC vs. PNET classification.
- Designs such as SGR and multi-granularity fusion are empirically validated for their contributions to final classification accuracy.
6. Data Flow, Pseudocode, and Pipeline Design
Both instantiations of MPHM provide explicit algorithmic schematics:
Image Deraining Forward Pass (Yu et al., 17 Nov 2025):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
function MPHM_forward(I_rain, text="No rain"):
Pv_base = DINOv2_encoder(I_rain) # frozen
Pv_adapted = DINOv2_adapter(Pv_base)
Pt_base = CLIP_text_encoder(text) # frozen
Pt_adapted = CLIP_adapter(Pt_base)
F_enc[0] = I_rain
for s in 1..5:
F_enc[s] = downsample(HMM_s(F_enc[s-1]))
F_dec[5] = F_enc[5]
for s in 5..1:
upl = upsample(F_dec[s])
F_b = concat(upl, F_enc[s-1])
Pv_l = resize(Pv_adapted, size(F_b))
Pt_l = resize(Pt_adapted, size(F_b))
F_pfi = PFI(F_b, Pv_l, Pt_l)
F_dec[s-1] = HMM_s−1(F_pfi)
R = Conv_final(F_dec[0])
I_pred = I_rain - R
return I_pred |
CECT-Mamba Forward-Backward Iteration (Gong et al., 16 Sep 2025):
1 2 3 4 5 6 7 8 9 10 11 |
for each minibatch of B patients:
# ROI localization and cropping
...
# Initial 3D feature encoding + SCI
...
# Dual-hierarchical Mamba encoder, spatial and temporal branches
...
# Multi-granularity fusion & classification
...
# Loss & backward
... |
7. Significance, Generalization, and Limitations
MPHM demonstrates clear advantages in multifaceted data integration and dual-domain feature modeling across vision and medical imaging settings. In deraining, its hierarchical prior fusion strategy delivers both higher fidelity restoration and superior perceptual metrics with moderate computational overhead (~10M parameters, ~62 GFLOPs) (Yu et al., 17 Nov 2025). In CECT tumor subtyping, MPHM yields clinically significant gains in classification with explicit modeling of spatial-temporal context (Gong et al., 16 Sep 2025).
A plausible implication is that the MPHM design pattern—namely, systematic hierarchical fusion of heterogeneous priors and dual-domain contextualization—may readily generalize to other restoration, segmentation, or classification problems involving multi-modal or multi-phase inputs. Conversely, performance and computational cost may be sensitive to the design of prior adapters and the calibration of hierarchical fusion hyperparameters; ablation results in (Yu et al., 17 Nov 2025) confirm sensitivity to fusion strategy.
MPHM stands as an exemplar for the interplay between foundation model priors and structured hierarchical modeling in contemporary deep learning pipelines.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free