EndoIR: Unified Endoscopic Image Restoration
- EndoIR is an endoscopic image restoration framework leveraging a unified diffusion model to overcome degradations like low lighting, smoke, and bleeding.
- Its dual-domain prompting and dual-stream diffusion structure fuse spatial and frequency features to guide denoising while preserving critical anatomical details.
- Noise-aware routing and task-adaptive embeddings optimize computational efficiency and improve clinical metrics, such as PSNR, SSIM, and segmentation accuracy.
EndoIR is an endoscopic image restoration framework that addresses the problem of real-world clinical degradations—such as low lighting, smoke, and bleeding—by employing a unified, degradation-agnostic diffusion model with novel conditioning and computation-efficient architectures. EndoIR distinguishes itself by requiring no degradation-specific labels or expert priors, operating as a single model for multiple simultaneous corruption types, and achieving state-of-the-art restoration and downstream clinical performance (Chen et al., 8 Nov 2025).
1. Model Design and Theoretical Principles
The architectural foundation of EndoIR is a conditional diffusion generative model based on the DDIM formulation, in which learned denoising is guided by explicitly designed spatial–frequency prompts and dynamic, task-conditioned embeddings. Instead of the standard U-Net backbone, EndoIR introduces a bespoke Dual-Stream Diffusion structure.
Central design features include:
- Dual-Domain Prompter (DDP): Generates fine-grained prompts by fusing spatial features (from raw image pixels) and frequency domain features (via FFT over the input) to guide the denoising process on both anatomical and degradation-specific signatures.
- Task Adaptive Embedding (TAE): Splits guidance into two branches: “shared” (content-invariant), formed by multiple parallel MLPs, and “specific” (corruption-aware), comprising K MLPs selected via a soft-voting TopK mechanism against the prompt vector.
- Dual-Stream Encoder (DSE): Separately encodes the corrupted image and the noise-added “clean” latent ; the resulting representations are fused only after parallel spatial attention and FFN blocks, which mitigates feature confusion found in naïve concatenation-based approaches.
- Rectified Fusion Block (RFB): Integrates the feature maps through modulated cross-attention, balancing sharp (softmax) and smooth (GELU) alignments controlled by two learnable weights.
- Noise-Aware Routing Block (NARB): Positioned in the decoder, routes computational effort dynamically onto the subset of channels most affected by noise, as determined by a softmax attention over channel descriptors. The active set is refined via residual blocks; inactive channels are passed through unchanged, reducing computational waste.
All parameters are optimized via the standard diffusion denoising objective:
where with , and both prompt () and task embedding () condition every denoising block.
2. Dual-Domain Prompting and Adaptive Conditioning
The DDP and TAE combination is central to EndoIR’s ability to generalize over unknown degradations:
- Dual-Domain Prompter (DDP): Given image , produce and via Conv2D layers; combine as:
with prompt dictionary .
- Task Adaptive Embedding (TAE):
The first sum encodes shared anatomical content; the second is sparsely activated according to the inferred degradation distribution, enabling the model to specialize adaptation for each input.
Ablation demonstrates a loss of ≥1 dB PSNR if either DDP or TAE is omitted, establishing their necessity for high restoration fidelity.
3. Dual-Stream Diffusion and Rectified Fusion
EndoIR’s encoder isolates degraded and clean features from the first layer:
- Dual-Stream Encoding: Inputs and are processed independently, then concatenated and given to a shared spatial attention block. The joint attention matrix is chunked, resulting in two disentangled feature streams post-attention, each passed through independent FFNs.
- Rectified Fusion: Fuses these streams via a compound attention mechanism:
These modules ensure degradation cues are leveraged without compromising shared anatomical priors.
4. Noise-Aware Routing for Efficient Computation
NARB improves runtime and parameter efficiency by routing computation dynamically:
- Channel descriptors .
- Relevance scoring: .
- Selecting top- indices .
- Refinement: Gather, process with residual blocks, and scatter back.
Optimized achieves the best PSNR/SSIM tradeoff, halving FLOPs with no significant loss in restoration quality.
5. Training, Datasets, and Benchmarked Performance
- Training: End-to-end Adam optimizer, lr=, batch size 8, 100 epochs.
- No adversarial, perceptual, or auxiliary losses; pure denoising loss is used.
- Key datasets: SegSTRONG-C (joint blood/low-light/smoke), CEC (capsule endoscopy illumination).
- Quantitative results on SegSTRONG-C:
- EndoIR(γ=0.5), 21.3M parameters, 11.18 FPS, achieves PSNR 32.23 dB, SSIM 87.64%, LPIPS 0.0664 (all averaged); exceeds previous SOTA [AMIR] by +0.12 dB PSNR, +1.79% SSIM, and halved LPIPS.
- Per-subtask: Blood removal 31.33 dB/86.76%, Low-light 32.64 dB/87.96%, Smoke 32.71 dB/88.20%.
- On CEC dataset: 32.65 dB PSNR (+1.96 dB over best prior), SSIM 96.98%, LPIPS 0.0481.
6. Downstream Clinical Impact
EndoIR’s restored images are directly applicable in subsequent clinical tasks:
- SegSTRONG-C downstream segmentation: Feeding EndoIR outputs to ResNet-101 + DeepLabV3+ yields Dice = 96.22%, mIoU = 91.40%, Acc = 98.97%—superior to all other restoration pipeline front-ends by >0.5% Dice/mIoU.
- A plausible implication is that EndoIR not only enhances visual quality and SNR, but also preserves critical anatomical structures required by downstream surgical or diagnostic models.
7. Comparative Characteristics and Limitations
| Model | Params (M) | FPS | PSNR (dB) | SSIM (%) | LPIPS |
|---|---|---|---|---|---|
| EndoIR (γ=0.5) | 21.3 | 11.18 | 32.23 | 87.64 | 0.0664 |
| AMIR (prev best) | 44.2 | 5.09 | 32.11 | 85.85 | 0.0777 |
Key properties:
- Unified, degradation-agnostic operation: No labels or manual tuning per degradation class.
- Parameter efficiency: EndoIR attains SOTA using fewer parameters and higher throughput than prior best.
- Dynamic computation: NARB allows real-time restoration (11+ FPS) with adaptive channel selection.
- Robustness and clinical usefulness: Outperforms others not just in visual metrics but also for clinical segmentation tasks.
Noted limitations:
- DDP FFT is superior to wavelet variants (+0.40 dB PSNR).
- Omitting DDP or TAE reduces performance by ≥1 dB PSNR.
- γ below 0.5 degrades results; above wastes compute.
- All training used batch size 8 and standard denoising loss; no adversarial or classifier heads.
EndoIR represents a tightly integrated, modular approach for restoring endoscopic images corrupted by a range of real-world degradations, leveraging dual-domain prompts, disentangled dual-stream fusion, and noise-aware computation. Its performance and practical versatility are validated across public benchmarks and domain-relevant clinical metrics (Chen et al., 8 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free