Papers
Topics
Authors
Recent
2000 character limit reached

Nested Unfolding Network (NUN) for Concealed Segmentation

Updated 29 November 2025
  • The NUN framework decouples restoration and segmentation via a nested DUN-in-DUN architecture that iteratively refines both tasks.
  • It employs a degradation-resistant unfolding network (DeRUN) with vision-language model guidance to adaptively handle arbitrary degradations.
  • Empirical results show that NUN achieves state-of-the-art accuracy on both clean and degraded benchmarks without relying on preset degradation models.

The Nested Unfolding Network (NUN) is a unified, interpretable architecture designed for real-world concealed object segmentation (COS) under arbitrary and unknown degradations. NUN achieves robust foreground-background separation by decoupling image restoration from segmentation using a novel DUN-in-DUN structure, embedding a degradation-resistant unfolding network (DeRUN) within each stage of a segmentation-oriented unfolding network (SODUN), and introduces bidirectional iterative interactions for mutual refinement. Vision-LLM (VLM) guidance provides degradation semantics without prior specification, and a multi-stage image-quality assessment mechanism ensures adaptive selection of restoration outputs. This framework attains leading performance on both clean and degraded benchmarks (He et al., 22 Nov 2025).

1. Architectural Foundation: DUN-in-DUN Structure

NUN embodies two hierarchically nested deep unfolding networks (DUNs):

  • The outer Segmentation-Oriented Deep Unfolding Network (SODUN) is composed of KK stages. In each stage kk, SODUN receives:
    • The previous stage’s high-quality restoration Xk−1T1X_{k-1}^{T1},
    • Foreground mask Mk−1M_{k-1},
    • Background estimate Bk−1B_{k-1}.

SODUN yields updated estimations MkM_k (foreground mask) and BkB_k (background).

  • The inner Degradation-Resistant Unfolding Network (DeRUN), with NkN_k iterations per SODUN stage, accepts the same set of inputs in addition to the original degraded observation YY. DeRUN iteratively reconstructs increasingly clean versions {Xk,n}n=0Nk\{X_{k,n}\}_{n=0}^{N_k} of the image.

This architecture enforces an explicit separation of the restoration and segmentation tasks. Interactions between the two DUNs are facilitated by the Bi-directional Unfolding Interaction (BUI) mechanism:

  • After NkN_k DeRUN iterations, each candidate restoration Xk,nX_{k,n} is scored using image-quality assessment (IQA), with the highest-scoring XkT1X_k^{T1} propagating to the next SODUN stage,
  • Concurrently, current SODUN outputs (Mk,Bk)(M_k, B_k) are injected as structural priors into DeRUN’s proximal step via a lightweight network X2\mathcal X_2, focusing restoration efforts on ambiguous regions.

This design ensures that segmentation and restoration are independently optimized while allowing reciprocal refinement across stages.

2. Mathematical Formulation of Iterative Segmentation and Restoration

The NUN’s mechanics are formalized as follows (notation: YY = degraded input, XX = ground-truth clean image):

SODUN (Stage kk)

  • Mask update (gradient descent):

M^k=Mk−1−αM∇M(12∥Xk−1T1−Xk−1T1⊙Mk−1−Bk−1∥22)\hat M_k = M_{k-1} - \alpha_M \nabla_M \Bigl(\tfrac{1}{2}\lVert X_{k-1}^{T1} - X_{k-1}^{T1} \odot M_{k-1} - B_{k-1} \rVert_2^2 \Bigr)

  • Mask update (proximal step):

Mk=M(M^k, Xk−1T1, Bk−1, Y)M_k = \mathcal M(\hat M_k,\, X_{k-1}^{T1},\, B_{k-1},\, Y)

Where M\mathcal M utilizes variational structure separation (VSS) with convolutional refinement.

  • Background update (gradient descent):

B^k=Bk−1−αB∇B(12∥Xk−1T1−Xk−1T1⊙Mk−Bk−1∥22)\hat B_k = B_{k-1} - \alpha_B \nabla_B \Bigl(\tfrac{1}{2}\lVert X_{k-1}^{T1} - X_{k-1}^{T1}\odot M_k - B_{k-1} \rVert_2^2 \Bigr)

  • Background update (proximal step):

Bk=B(B^k, Mk, Xk−1T1, Y)B_k = \mathcal B(\hat B_k,\, M_k,\, X_{k-1}^{T1},\, Y)

with B\mathcal B implemented via a compact U-Net.

DeRUN (Stage kk, Iteration nn)

  • VLM-based degradation inference:

dk,n=DA-CLIP(Xk,n−1)d_{k,n} = \mathrm{DA\text{-}CLIP}(X_{k,n-1})

  • Degradation operator construction:

RCk,nD(Z)=σk,n⊙(CRC(Z)+Z)+μk,nRC_{k,n}^D(Z) = \sigma_{k,n} \odot (\mathrm{CRC}(Z) + Z) + \mu_{k,n}

  • Restoration (gradient descent):

X^k,n=Xk,n−1−αXRCk,nD T(RCk,nD(Xk,n−1)−Xk−1T1)\hat X_{k,n} = X_{k,n-1} - \alpha_X RC_{k,n}^{D\,T}(RC_{k,n}^D(X_{k,n-1}) - X_{k-1}^{T1})

  • Restoration (proximal step):

Xk,n=X1(X^k,n)+X2(Bk,Mk,Y)X_{k,n} = \mathcal X_1(\hat X_{k,n}) + \mathcal X_2(B_k, M_k, Y)

X2\mathcal X_2 leverages MkM_k, BkB_k, and YY for targeted enhancement of segmentation-ambiguous regions.

  • Stage-wise selection: From {Xk,n}n=1Nk\{X_{k,n}\}_{n=1}^{N_k}, the best XkT1X_k^{T1} is chosen via a composite IQA score combining TOPIQ, Q-Align, and MUSIQ metrics.

3. Vision-LLM Guidance for Degradation Semantics

A vision-LLM (VLM), specifically DA-CLIP, is integrated to infer degradation semantics directly from input images, removing the requirement for pre-defined or explicit priors. Formally, the VLM approximates a posterior over possible degradation types:

p(d∣Y)≈softmax(Eimg(Y)Etext(d))p(d \mid Y) \approx \mathrm{softmax}(E_\mathrm{img}(Y) E_\mathrm{text}(d))

where {d}\{d\} is a pre-specified vocabulary (e.g., haze, low-light), and EimgE_\mathrm{img}, EtextE_\mathrm{text} are embedding functions for images and text respectively.

The VLM output modulates the unfolding-step operators via σk,n\sigma_{k,n}, μk,n\mu_{k,n}, which are derived using convolutional transformations on the DA-CLIP embedding dk,nd_{k,n}. This enables DeRUN to adapt restoration strategies dynamically to varying types of degradation encountered in real-world imagery.

4. Loss Functions, Optimization Strategy, and Consistency

NUN applies comprehensive supervision across all stages for both restoration and segmentation, introducing cross-stage regularization for stability:

  • Restoration-fidelity loss:

Lrest=∑k=1K12K−k∥XkT1−X∥22L_\mathrm{rest} = \sum_{k=1}^K \frac{1}{2^{K-k}} \lVert X_k^{T1} - X \rVert_2^2

  • Segmentation loss: Weighted sum of binary cross-entropy (BCE) and intersection-over-union (IoU):

Lseg=∑k=1K12K−k[LBCEw(Mk,GTs)+LIoUw(Mk,GTs)]L_\mathrm{seg} = \sum_{k=1}^K \frac{1}{2^{K-k}} [L^w_\mathrm{BCE}(M_k,GT_s) + L^w_\mathrm{IoU}(M_k,GT_s)]

  • Cross-stage consistency loss: Mask stability under alternate restoration:

Lcsc=∑k=1K12K−k[LBCEw(Mk,MkT2)+LIoUw(Mk,MkT2)]L_\mathrm{csc} = \sum_{k=1}^K \frac{1}{2^{K-k}} [L^w_\mathrm{BCE}(M_k,M_k^{T2}) + L^w_\mathrm{IoU}(M_k,M_k^{T2})]

  • Total training objective:

Ltotal=Lrest+Lseg+ϵ LcscL_\mathrm{total} = L_\mathrm{rest} + L_\mathrm{seg} + \epsilon\, L_\mathrm{csc}

Weights ϵ\epsilon and LwL^w are hyperparameters controlling regularization.

Image-quality assessment in each DeRUN stage ensures only the highest-IQA restoration is propagated, and the cross-stage consistency loss promotes mask robustness to subtle changes in restorations.

5. Bi-Directional Feature Exchange and Cross-Stage Refinement

The BUI mechanism is central to NUN’s iterative, interpretable refinement loop:

  • Segmentation-to-Restoration: DeRUN’s proximal network X2\mathcal X_2 utilizes SODUN’s current (Mk,Bk,Y)(M_k, B_k, Y) outputs to prioritize restoration in regions where segmentation is ambiguous.
  • Restoration-to-Segmentation: SODUN, rather than operating directly on the raw degraded YY, uses the progressively restored XT1X^{T1} generated by DeRUN, ensuring accurate gradient estimates during mask and background separation.

This exchange is realized mathematically in the update rules for both networks, maintaining decoupling of optimization objectives while enabling synergistic improvement for both tasks across multiple stages.

6. Performance, Benchmarking, and Significance

Extensive empirical evaluation demonstrates that NUN attains leading segmentation accuracy on both clean and degraded test sets for concealed object segmentation. The mechanism of selecting the best restoration output via IQA and enforcing cross-stage segmentation mask stability by self-consistency loss yields robustness against a broad spectrum of real-world degradation scenarios, without reliance on pre-defined degradation models (He et al., 22 Nov 2025).

This suggests NUN’s architectural paradigm—alternating, decoupled unfolding with reciprocal guidance—may generalize to other tasks where restoration and semantic estimation have conflicting or complementary objectives. A plausible implication is applicability in domains such as biomedical imaging and remote sensing where similar degradation-agnostic strategies are advantageous.

7. Interpretability and Future Implications

The iterative, stage-wise nature of NUN guarantees interpretability, as each sub-network’s actions and information flow are transparent by design. The explicit decoupling of restoration and segmentation prevents conflicting learning signals, and the vision-language interface ensures continual adaptation to unknown and variable degradation phenomena.

Given its robust empirical performance and principled bidirectional refinement schema, further exploration of DUN-in-DUN architectures is warranted in tasks involving joint low-level and high-level vision. Open research directions include applying similar nested unfolding paradigms to other structured optimization tasks, expanding the set of guided priors via more advanced VLMs, and formal analysis of convergence and interpretability guarantees under diverse degradation conditions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Nested Unfolding Network (NUN).