Papers
Topics
Authors
Recent
2000 character limit reached

Depth-Aware Scanning in DMDNet

Updated 8 January 2026
  • Depth-Aware Scanning (DAScan) is a depth-guided sequential method that orders pixel processing by proximity to effectively disentangle transmission and reflection layers.
  • It employs complementary region-based and global scanning strategies to maximize structural clarity and enhance recovery in complex, low-contrast images.
  • Integrated with DS-SSM and MECM in DMDNet, DAScan contributes to state-of-the-art performance on both daytime and nighttime benchmarks.

The Depth-Memory Decoupling Network (DMDNet) is an architecture designed for single-image reflection separation, with a particular emphasis on challenging low-contrast conditions such as nighttime scenes. It aims to recover the transmission layer (T) and reflection layer (R) from a blended image using a synergy of depth-guided scanning, depth-modulated state-space modeling, and memory-based cross-image compensation. DMDNet integrates three key algorithmic modules—Depth-Aware Scanning (DAScan), Depth-Synergized State-Space Model (DS-SSM), and the Memory Expert Compensation Module (MECM)—and introduces the Nighttime Image Reflection Separation (NightIRS) dataset to address the scarcity of annotated nighttime benchmarks (Fang et al., 1 Jan 2026).

1. Architectural Overview

DMDNet processes an input blended image I∈R3×H×WI\in\mathbb{R}^{3\times H\times W} to output the separated layers TT and RR. The architecture consists of the following components:

  • Encoding Branch: A two-stream extractor (MuGI) that learns multi-scale features relevant for both TT and RR.
  • Depth Semantic Modulation Branch: Processes a precomputed proximity (depth) map PP to yield depth semantic features {DSi}\{D^i_S\} via lightweight convolutional layers.
  • Decoding Branch: Stacked Depth-Memory Decoupling blocks (DMBlocks) reconstruct TT and RR. Each DMBlock contains:
    • DSMamba (a depth-synergized Mamba variant for decoupling)
    • Memory Expert Compensation Module (MECM)
    • Efficient feed-forward network (EFFN)

The motivation stems from the observation that, especially under low-contrast conditions, TT and RR may have similar intensities, complicating disentanglement. Depth cues are used to highlight structurally coherent regions likely belonging to TT, while memory-based patterns leverage historical cross-image knowledge for robust restoration (Fang et al., 1 Jan 2026).

2. Depth-Aware Scanning (DAScan)

DAScan directs sequential state-space processing by ordering the spatial scan to emphasize regions of semantic and geometric salience, as indicated by the proximity (depth) map P∈RH×WP\in\mathbb{R}^{H\times W}. Two complementary scanning strategies are defined:

  • Region-based Scan for Transmission (DA-RScan):
    • Segments the image into connected regions using a binary threshold on PP.
    • Regions are sorted by area (largest first), and pixels within each region are ordered by proximity (near-to-far).
    • Generates permutation Ï€T\pi_T to maximize early inclusion of salient structural information.
  • Global Scan for Reflection (DA-GScan):
    • All pixels are globally ranked by descending proximity.
    • Resulting permutation Ï€R\pi_R supports a holistic, depth-biased scan relevant for reflection modeling.

This strategy aims to preserve object continuity and prevent the propagation of ambiguity in state processing. The sequential order is programmable according to the layer (transmission or reflection) being reconstructed, as formalized in the provided pseudocode (Fang et al., 1 Jan 2026).

3. Depth-Synergized State-Space Model (DS-SSM)

DS-SSM generalizes standard state-space models by introducing modulation of state updates based on local depth. For the tt-th scanned pixel xtx_t, the update equations become:

ht=A ht−1+Btaware xt,yt=Ctaware ht+D xth_t = A\,h_{t-1} + B^{\rm aware}_t\,x_t, \qquad y_t = C^{\rm aware}_t\,h_t + D\,x_t

where the awareness tensors are

Btaware=(1−γt)B+γtBdepth,Ctaware=(1−γt)C+γtCdepthB^{\rm aware}_t = (1-\gamma_t)B + \gamma_t B_{\rm depth}, \quad C^{\rm aware}_t = (1-\gamma_t)C + \gamma_t C_{\rm depth}

γt=σ(α(P(π(t))−τ))\gamma_t = \sigma(\alpha(P(\pi(t))-\tau))

with σ\sigma the sigmoid function and α,τ\alpha, \tau learnable parameters. BdepthB_{\rm depth} and CdepthC_{\rm depth} are depth-conditioned transforms that enhance the influence of high-proximity (less ambiguous) regions. This suppresses the propagation of ambiguous features, especially in low-contrast, structure-poor areas.

Spatial Positional Encoding is further incorporated by augmenting hth_t with a sine/cosine encoding of normalized pixel coordinates, providing translational context for the state update (Fang et al., 1 Jan 2026).

4. Memory Expert Compensation Module (MECM)

MECM introduces learnable cross-image "memory experts" for modulation and compensation of both global patterns and local context:

  • Expert Gate: Dynamically selects KK out of NN experts for each sample, producing separate gating weights for transmission and reflection to enable layer-specific compensation.
  • Memory Experts: Each consists of two parallel streams:
    • GPStream (Global-Pattern Interaction): Computes batch-to-memory and memory-to-batch affinities to extract and reinforce globally consistent patterns using a memory bank Mem∈RM×C\mathrm{Mem}\in\mathbb{R}^{M \times C}. Retrieval, fusion, and memory writes operate as explicit steps balancing adaptation and stability.
    • SCStream (Spatial-Context Refinement): Reshapes memories as spatial kernels and performs top-kk retrieval, with softmax-weighted aggregation for each spatial location, refining the reconstruction with spatially relevant compensatory features.

Layer-specific routes allow TT- and RR-related experts to specialize in, respectively, transmission or reflection restoration, enforced through appropriate gating (Fang et al., 1 Jan 2026).

5. Loss Functions and Training Protocol

DMDNet blends multiple objective terms:

  • Appearance Loss (Lapp\mathcal{L}_{app}): Combination of pixel-wise L1L_1 loss and perceptual (VGG-based) loss on both T^\hat{T} and R^\hat{R}.
  • Memory Matching Loss (Lmem\mathcal{L}_{mem}): Includes a triplet loss (pull image features toward their top memory, push away from second-best) and an alignment loss (keep image features close to their memory assignment).
  • Load Balancing Loss (Lload\mathcal{L}_{load}): Penalizes variance in expert gate weights to discourage mode collapse and promote expert diversity.

Training proceeds on a mixture of PASCAL VOC (7,643 pairs), Nature (200), and Real (89) datasets, using Adam with progressive learning rate decay and basic augmentations (random crop, flip). Batch size is set to 1 on a single RTX 4090 GPU (Fang et al., 1 Jan 2026).

6. NightIRS Dataset

NightIRS is a dedicated nighttime reflection separation dataset introduced to address scarcity of real-world low-light benchmarks. Its key properties include:

  • Composition: Contains 1,000 triplets (I,T,R)(I, T, R).
  • Capture: Acquired with a Sony LYTIA-T808 in urban, street, indoor, and other low-light environments. Reflection effects are introduced using acrylic/glass sheets (1–8 mm thick).
  • Depth Annotation: Proximity maps provided by MiDaS v3.1 Next-ViT-L.
  • Image Quality: HDR imaging and tripod stabilization reduce noise and ensure high-fidelity ground truth (Fang et al., 1 Jan 2026).

7. Evaluation and Empirical Results

Quantitative assessment uses PSNR, SSIM, and LPIPS metrics. Summary results are as follows:

  • Transmission Layer:
    • Daytime: PSNR = 26.27 dB, SSIM = 0.889, LPIPS = 0.093 (surpasses DSIT, RDNet, and other SOTA baselines).
    • NightIRS: PSNR = 25.24 dB, SSIM = 0.832, LPIPS = 0.144.
  • Reflection Layer:
    • Daytime: PSNR = 22.31 dB, SSIM = 0.522, LPIPS = 0.403 (best across evaluated models).
    • NightIRS: PSNR = 28.37 dB, SSIM = 0.633, LPIPS = 0.286.

The ablation studies presented in the original work empirically isolate the importance of DSMamba depth-aware scanning, DS-SSM, and MECM components. Notably, replacing DS-SSM with a vanilla SSM or omitting either DAScan modality degrades PSNR by 0.4–1.4 dB. Excluding GPStream or SCStream similarly reduces quality, with the absence of MECM resulting in the most pronounced drop in performance and model complexity. Higher-quality depth estimation yields consistently stronger results; omitting depth reduces average PSNR by over 2 dB across test sets (Fang et al., 1 Jan 2026).

8. Component Ablations and Depth Quality Assessment

Ablations conducted on DSMamba, MECM, and depth model variants reveal the following:

Configuration PSNR SSIM LPIPS Param (M) FLOPs (G)
DA-R, DA-G, DS-SSM, SPE 26.27 0.889 0.093 87.2 39.3
DA-R, DA-G, Orig SSM 25.78 0.884 0.098 83.3 38.6
Orig Scanning, DS-SSM 25.69 0.884 0.096 89.4 39.2
DA-G, DA-R, DS-SSM 26.09 0.887 0.096 87.2 39.3
DA-R, DA-G, DS-SSM, no SPE 25.66 0.882 0.105 87.2 39.3

For MECM, the inclusion of both GPStream and SCStream with four experts and two selected per sample gives the best trade-off. Lower depth model quality or omission systematically degrades all metrics, particularly PSNR and LPIPS in nighttime scenarios.

9. Significance and Impact

DMDNet achieves state-of-the-art separation for both transmission and reflection layers under a wide variety of lighting conditions. Its design demonstrates that depth-synergized sequential feature propagation and cross-image memory can address the ambiguity inherent in single-image reflection separation, especially in low-contrast, real-world scenes. The introduction of NightIRS further enables systematic benchmarking of new models under previously underrepresented nighttime scenarios (Fang et al., 1 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Depth-Aware Scanning (DAScan).