Papers
Topics
Authors
Recent
2000 character limit reached

EndoIR: Unified Endoscopic Image Restoration

Updated 16 November 2025
  • EndoIR is an endoscopic image restoration framework leveraging a unified diffusion model to overcome degradations like low lighting, smoke, and bleeding.
  • Its dual-domain prompting and dual-stream diffusion structure fuse spatial and frequency features to guide denoising while preserving critical anatomical details.
  • Noise-aware routing and task-adaptive embeddings optimize computational efficiency and improve clinical metrics, such as PSNR, SSIM, and segmentation accuracy.

EndoIR is an endoscopic image restoration framework that addresses the problem of real-world clinical degradations—such as low lighting, smoke, and bleeding—by employing a unified, degradation-agnostic diffusion model with novel conditioning and computation-efficient architectures. EndoIR distinguishes itself by requiring no degradation-specific labels or expert priors, operating as a single model for multiple simultaneous corruption types, and achieving state-of-the-art restoration and downstream clinical performance (Chen et al., 8 Nov 2025).

1. Model Design and Theoretical Principles

The architectural foundation of EndoIR is a conditional diffusion generative model based on the DDIM formulation, in which learned denoising is guided by explicitly designed spatial–frequency prompts and dynamic, task-conditioned embeddings. Instead of the standard U-Net backbone, EndoIR introduces a bespoke Dual-Stream Diffusion structure.

Central design features include:

  • Dual-Domain Prompter (DDP): Generates fine-grained prompts by fusing spatial features (from raw image pixels) and frequency domain features (via FFT over the input) to guide the denoising process on both anatomical and degradation-specific signatures.
  • Task Adaptive Embedding (TAE): Splits guidance into two branches: “shared” (content-invariant), formed by multiple parallel MLPs, and “specific” (corruption-aware), comprising K MLPs selected via a soft-voting TopK mechanism against the prompt vector.
  • Dual-Stream Encoder (DSE): Separately encodes the corrupted image xx and the noise-added “clean” latent yty_t; the resulting representations are fused only after parallel spatial attention and FFN blocks, which mitigates feature confusion found in naïve concatenation-based approaches.
  • Rectified Fusion Block (RFB): Integrates the feature maps through modulated cross-attention, balancing sharp (softmax) and smooth (GELU) alignments controlled by two learnable weights.
  • Noise-Aware Routing Block (NARB): Positioned in the decoder, routes computational effort dynamically onto the subset of channels most affected by noise, as determined by a softmax attention over channel descriptors. The active set is refined via NresN_{res} residual blocks; inactive channels are passed through unchanged, reducing computational waste.

All parameters are optimized via the standard diffusion denoising objective:

Ldenoise=Ex0,ϵ,tϵϵθ(ytP, Etask)22\mathcal{L}_{\mathrm{denoise}} = \mathbb{E}_{x_{0},\epsilon,t} \Bigl\| \epsilon - \epsilon_{\theta}\bigl( y_{t}\mid\mathbb{P},\ E_{\mathrm{task}} \bigr) \Bigr\|_{2}^{2}

where yt=α^tx0+1α^tϵy_t = \sqrt{\hat\alpha_t} x_0 + \sqrt{1-\hat\alpha_t}\epsilon with ϵN(0,I)\epsilon\sim\mathcal{N}(0,I), and both prompt (P\mathbb{P}) and task embedding (EtaskE_{\mathrm{task}}) condition every denoising block.

2. Dual-Domain Prompting and Adaptive Conditioning

The DDP and TAE combination is central to EndoIR’s ability to generalize over unknown degradations:

  • Dual-Domain Prompter (DDP): Given image xx, produce FimgF_{\mathrm{img}} and FfreqF_{\mathrm{freq}} via Conv2D layers; combine as:

P=AvgPool(Softmax(Fimg+Ffreq)D)\mathbb{P} = \mathrm{AvgPool}\Bigl( \mathrm{Softmax}(F_{\mathrm{img}} + F_{\mathrm{freq}})\cdot \mathbb{D} \Bigr)

with prompt dictionary DRCp×d\mathbb{D}\in\mathbb{R}^{C_p\times d}.

  • Task Adaptive Embedding (TAE):

Etask=i=1nMLPi(P)+k=1KMLPk(P)TopK(Softmax(P))E_{\text{task}} = \sum_{i=1}^n \mathrm{MLP}_i(\mathbb{P}) + \sum_{k=1}^K \mathrm{MLP}_k(\mathbb{P})\cdot\mathrm{TopK}(\mathrm{Softmax}(\mathbb{P}))

The first sum encodes shared anatomical content; the second is sparsely activated according to the inferred degradation distribution, enabling the model to specialize adaptation for each input.

Ablation demonstrates a loss of ≥1 dB PSNR if either DDP or TAE is omitted, establishing their necessity for high restoration fidelity.

3. Dual-Stream Diffusion and Rectified Fusion

EndoIR’s encoder isolates degraded and clean features from the first layer:

  • Dual-Stream Encoding: Inputs xx and yty_t are processed independently, then concatenated and given to a shared spatial attention block. The joint attention matrix is chunked, resulting in two disentangled feature streams post-attention, each passed through independent FFNs.
  • Rectified Fusion: Fuses these streams via a compound attention mechanism:

Attn=[w1Softmax(QK)+w2GELU(QK)]V\mathrm{Attn} = [w_1\,\mathrm{Softmax}(QK^\top) + w_2\,\mathrm{GELU}(QK^\top)]\,V

Ffuse=FFN(Conv2D(Attn)+Fx)F_{\text{fuse}} = \mathrm{FFN}(\mathrm{Conv2D}(\mathrm{Attn}) + F_{\overline{x}})

These modules ensure degradation cues are leveraged without compromising shared anatomical priors.

4. Noise-Aware Routing for Efficient Computation

NARB improves runtime and parameter efficiency by routing computation dynamically:

  1. Channel descriptors fc=AdaptiveAvgPool(Fin)f_c = \mathrm{AdaptiveAvgPool}(F_{\mathrm{in}}).
  2. Relevance scoring: a=Softmax(W2(W1fc))a = \mathrm{Softmax}(W_2(W_1 f_c)).
  3. Selecting top-kk indices I=TopK(a,k=γC)\mathcal{I} = \mathrm{TopK}(a, k = \lceil \gamma\,C\rceil ).
  4. Refinement: Gather, process with NresN_{\mathrm{res}} residual blocks, and scatter back.

Optimized γ0.5\gamma\approx0.5 achieves the best PSNR/SSIM tradeoff, halving FLOPs with no significant loss in restoration quality.

5. Training, Datasets, and Benchmarked Performance

  • Training: End-to-end Adam optimizer, lr=2×1042\times 10^{-4}, batch size 8, 100 epochs.
  • No adversarial, perceptual, or auxiliary losses; pure denoising loss is used.
  • Key datasets: SegSTRONG-C (joint blood/low-light/smoke), CEC (capsule endoscopy illumination).
  • Quantitative results on SegSTRONG-C:
    • EndoIR(γ=0.5), 21.3M parameters, 11.18 FPS, achieves PSNR 32.23 dB, SSIM 87.64%, LPIPS 0.0664 (all averaged); exceeds previous SOTA [AMIR] by +0.12 dB PSNR, +1.79% SSIM, and halved LPIPS.
    • Per-subtask: Blood removal 31.33 dB/86.76%, Low-light 32.64 dB/87.96%, Smoke 32.71 dB/88.20%.
  • On CEC dataset: 32.65 dB PSNR (+1.96 dB over best prior), SSIM 96.98%, LPIPS 0.0481.

6. Downstream Clinical Impact

EndoIR’s restored images are directly applicable in subsequent clinical tasks:

  • SegSTRONG-C downstream segmentation: Feeding EndoIR outputs to ResNet-101 + DeepLabV3+ yields Dice = 96.22%, mIoU = 91.40%, Acc = 98.97%—superior to all other restoration pipeline front-ends by >0.5% Dice/mIoU.
  • A plausible implication is that EndoIR not only enhances visual quality and SNR, but also preserves critical anatomical structures required by downstream surgical or diagnostic models.

7. Comparative Characteristics and Limitations

Model Params (M) FPS PSNR (dB) SSIM (%) LPIPS
EndoIR (γ=0.5) 21.3 11.18 32.23 87.64 0.0664
AMIR (prev best) 44.2 5.09 32.11 85.85 0.0777

Key properties:

  • Unified, degradation-agnostic operation: No labels or manual tuning per degradation class.
  • Parameter efficiency: EndoIR attains SOTA using fewer parameters and higher throughput than prior best.
  • Dynamic computation: NARB allows real-time restoration (11+ FPS) with adaptive channel selection.
  • Robustness and clinical usefulness: Outperforms others not just in visual metrics but also for clinical segmentation tasks.

Noted limitations:

  • DDP FFT is superior to wavelet variants (+0.40 dB PSNR).
  • Omitting DDP or TAE reduces performance by ≥1 dB PSNR.
  • γ below 0.5 degrades results; above wastes compute.
  • All training used batch size 8 and standard denoising loss; no adversarial or classifier heads.

EndoIR represents a tightly integrated, modular approach for restoring endoscopic images corrupted by a range of real-world degradations, leveraging dual-domain prompts, disentangled dual-stream fusion, and noise-aware computation. Its performance and practical versatility are validated across public benchmarks and domain-relevant clinical metrics (Chen et al., 8 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to EndoIR.