Mamba-UIE: Hybrid Physics-Based Underwater UIE
- Mamba-UIE is a computational framework that integrates state-space modeling with physically constrained decomposition to enhance underwater images.
- It adopts a revised image formation model to decouple direct transmission and backscatter, ensuring reconstruction consistency and improved PSNR/SSIM metrics.
- The architecture employs parallel modules and a novel Mamba-in-Convolution block to efficiently capture both global and local dependencies.
Mamba-UIE is a computational framework for underwater image enhancement (UIE) that fuses state-space modeling (Mamba) with a physically constrained, decompositional approach to the underwater image formation process. It addresses the deficiencies of CNNs (limited long-range modeling) and Transformers (quadratic complexity) by leveraging Mamba, a state space model with linear sequence complexity, and by incorporating a physical consistency constraint derived from a revised underwater image formation model. Mamba-UIE emphasizes both global and local information integration, achieving state-of-the-art performance on several UIE benchmarks (Zhang et al., 2024).
1. Revised Underwater Image Formation Model
At the core of Mamba-UIE is the adoption of an energy-conserving underwater image formation model, replacing the classic two-term Koschmieder formulation. The Akkaynak & Treibitz model distinguishes direct transmission from backscatter, both governed by separate attenuation coefficients ( and ):
where:
- : observed underwater image
- : latent scene radiance (“clean” image)
- : direct transmission
- : backscatter transmission
- : global background light
- : scene depth at pixel
This decomposition explicitly predicts four latent components: , , , . A reconstruction step synthesizes the image from these, enforcing consistency with the input through a joint loss.
2. Physical-Consistency and Total Loss Formulation
The network is trained with a physically motivated reconstruction consistency loss:
The total loss combines this reconstruction term with direct prediction losses on scene radiance:
with . Ground-truth supervision ensures the enhanced image approximates the ideal “clear” scene while maintaining image formation fidelity via the reconstructed image.
Ablation shows removing this physics-based reconstruction penalty leads to a PSNR drop from to , matching the gain achieved by inclusion of the Mamba branch (Zhang et al., 2024).
3. Mamba-UIE Network Architecture
Mamba-UIE factorizes estimation into four parallel modules:
- J-Net: estimates the scene radiance , structured as a hybrid CNN+Mamba SSM with no spatial downsampling.
- TD-Net: regresses direct transmission (6-layer CNN).
- TB-Net: regresses backscatter transmission (6-layer CNN).
- GBL module: regresses global background light using per-channel statistics and a small regression head.
a) SSM Integration via Mamba-in-Convolution
J-Net integrates Mamba blocks—linear complexity state-space layers—directly with CNN modules. Each Mamba-in-Convolution (MIC) block:
- Applies Conv InstanceNorm Mish activation to modulate channel count.
- Uses a Channel-Spatial Siamese (CSS) structure: the feature map is reshaped both along spatial (for channel-level modeling) and channel (for spatial modeling) axes and processed by stacked Mamba layers.
- Aggregates global (via CSS/Mamba) and local (CNN path) representations through residual summation.
By design, this hybrid modeling enables Mamba-UIE to capture both long-range and local dependencies, overcoming the CNN’s locality and Transformer’s prohibitive complexity for high-resolution images.
b) Parallel Estimation, Supervision, and Output
TD-Net/TB-Net use shallow CNNs to estimate and , respectively; GBL estimates via statistics and regression. The four components are recombined (see Section 1), and both and are directly supervised.
4. Training Procedure and Quantitative Results
Training leverages:
- UIEB: $890$ paired images (underwater/clear); also tested on Challenging60 [no-ref], EUVP ($1,200$ pairs), U45 ($45$ unpaired challenging images).
- All images are resized/cropped to .
- Adam optimizer, learning rate , batch size $1$.
Performance on UIEB:
- MSE:
- PSNR:
- SSIM: $0.93$
EUVP: MSE , PSNR , SSIM $0.83$.
No-reference metrics:
- Challenging60: UIQM $2.72$, UCIQE $0.59$
- U45: UIQM $3.32$, UCIQE $0.61$
Comparative ranking: On UIEB, Mamba-UIE ranks first in MSE/PSNR and second in SSIM; on EUVP it is top-3 for all scores (Zhang et al., 2024).
Ablation tests confirm the necessity of both the physical reconstruction penalty and the Mamba SSM branch; removing either yields a PSNR/SSIM reduction of /$0.01$.
A comparison of formation models shows the revised Akkaynak model achieves superior PSNR/SSIM/UIQM over Koschmieder, Retinex, and Jaffe–McGlamery alternatives.
5. Context and Contributions in the UIE Landscape
Mamba-UIE is positioned within a class of recent SSM-based UIE models that address the dual need for efficiency and global context modeling. While methods like O-Mamba (Dong et al., 2024), RD-UIE (Jiang et al., 2 May 2025), and MambaUIE (Chen et al., 2024) emphasize novel state-space scanning or dynamic fusion mechanisms, Mamba-UIE is unique in its explicit, physics-driven image formation decomposition, parallel estimation, and reconstruction-consistency penalty.
The MIC block architecture, which exposes feature maps to both channel-wise and spatial-wise Mamba modeling and includes a residual CNN path, is notable for efficient sequence modeling at linear cost in large images. This design allows Mamba-UIE to outperform competing Transformer/CNN hybrids on both accuracy and computational complexity, especially at high input resolutions.
6. Implications, Limitations, and Future Work
Mamba-UIE demonstrates that linear-complexity SSMs combined with physically grounded constraints yield robust UIE across diverse datasets, even in the absence of explicit adversarial loss or perceptual penalties. Both reconstruction physics and advanced global dependency modeling provide substantial and separable performance gains.
A plausible implication is that future UIE models may further leverage hybrid SSM physics-constrained designs, including end-to-end learned depth/attenuation estimation for more challenging, uncalibrated scenarios, or extend to multi-frame (video) settings with temporal SSMs. The current design, however, relies on sufficient paired data and explicit supervision for each latent component. Exploring self-supervised estimates of the scene radiance and attenuation maps is a potential future extension (Zhang et al., 2024).