RCDNet: Interpretable Deep Architectures

Updated 19 December 2025

RCDNet is a family of interpretable deep neural networks characterized by modular architectures grounded in physical and semantic priors.
In single image deraining, it employs convolutional dictionary modeling and proximal gradient unfolding to separate rain streaks from clean images, achieving high PSNR and SSIM scores.
For referring change detection, RCDNet uses Siamese encoders and CLIP-based cross-modal fusion to accurately identify semantic changes in remote sensing imagery.

RCDNet is an architectural designation independently used in multiple research domains, including single image deraining (Wang et al., 2020, Wang et al., 2021) and referring change detection in remote sensing imagery (Korkmaz et al., 12 Dec 2025). In each context, RCDNet specifically denotes a physically or semantically interpretable deep neural network whose modular operations correspond to algorithmic steps grounded in mathematical models, optimization solvers, or cross-modal fusion logic. Below, the three major instantiations of RCDNet are comprehensively profiled.

1. RCDNet for Single Image Deraining

RCDNet, in the context of image restoration, refers to the "Rain Convolutional Dictionary Network"—a model-driven deep neural network that implements an interpretable layer-wise architecture for decomposing a rainy image into clean background and rain streak components using convolutional dictionary modeling and proximal gradient descent (Wang et al., 2020, Wang et al., 2021).

Rain Model and Optimization Formulation

Given a rainy RGB image $Y \in \mathbb{R}^{H \times W \times 3}$ , RCDNet postulates

$Y = X + R,$

where $X$ is the target clean background and $R$ the rain layer. The rain layer is parameterized using a convolutional dictionary:

$R = \sum_{n=1}^N C_n \otimes M_n,$

with $C_n$ denoting physically motivated "rain kernels" and $M_n$ their spatially varying coefficient maps. Layer separation is achieved via regularized least-squares minimization:

$\min_{X, M_n} ~ \frac{1}{2}\|Y - X - \sum_{n} C_n \otimes M_n\|_F^2 + \alpha g_1(M) + \beta g_2(X),$

with $g_1$ , $g_2$ as learned (rather than fixed) priors, whose proximal operators are realized by compact ResNets.

Proximal Gradient Deep Unfolding

Optimization is operationalized through iterative proximal steps alternating between rain maps ( $M$ ) and background ( $X$ ). Each operation is unfolded into distinct network stages:

M-net updates $M$ by computing the rain residual and back-projecting via $C_n^T$ .
B-net updates $X$ given the latest rain separation, fusing prior and reconstructed backgrounds.

Every layer is interpretable: dictionaries $C_n$ encode prototypical streak geometries, while proxNets embody adaptive regularization on rain/background structure. End-to-end learning trains all kernels, step sizes, and proxNet weights directly from data.

Dynamic RCDNet (DRCDNet) for Cross-domain Generalization

The dynamic extension, DRCDNet, incorporates a global dictionary $D$ and per-image kernel mixing weights $\alpha_n$ , forming $K_n = D \alpha_n$ . This reparameterization enables kernel adaptation to new rain patterns, shrinking the search space and preserving strong generalization when train/test rain statistics differ.

Empirical Performance and Interpretability

RCDNet and DRCDNet demonstrate state-of-the-art PSNR and SSIM on synthetic and real-world benchmarks (Rain100L, Rain100H, Rain1400, SPA-Data), with DRCDNet attaining superior cross-domain results. All modules are directly visualizable: rain maps become sparse across stages, dictionaries reveal direction and thickness patterns of rain, and the white-box design aids analysis.

2. RCDNet for Referring Change Detection in Remote Sensing

In the domain of change detection, RCDNet denotes a cross-modal fusion architecture for "Referring Change Detection"—the identification of specific semantic changes between temporally separated remote sensing images conditioned on natural-language prompts (Korkmaz et al., 12 Dec 2025).

Problem and Two-Stage Framework

Traditional semantic change detection is hindered by rigid output-channel coupling to label sets and poor adaptability. RCDNet addresses these via:

RCDGen: a diffusion-based pipeline synthesizing post-change images and corresponding change masks from pre-change images and class names, mitigating annotated-data scarcity and class imbalance.
RCDNet: a neural network ingesting $(I_\text{pre}, I_\text{post}, C)$ to produce a binary change-of-interest map $M$ , where $C$ is a user-provided textual description.

Network Architecture

RCDNet is composed of:

Siamese Encoder Blocks (SEB): parallel VMamba state-space models extract features from pre- and post-change images.
Fusion Module (FM): performs local 1×1 projections and depth-wise convolutions, concatenates spatial tokens, and applies multi-head self-attention for cross-image feature interaction.
Mask Decoder Blocks (MDB): hierarchical decoder stacks upsample spatially while injecting the language prompt via CLIP-based cross-attention Transformers. Each decoder block implements channel-wise attention and cross-modal fusion at each scale.

Training and Inference

CLIP text encoder generates prompt embeddings. Pixel-wise binary cross-entropy loss guides mask prediction. Training leverages both synthetic (RCDGen) and real labeled data, facilitating robustness to class imbalance and mixed domain statistics.

Quantitative Performance

RCDNet achieves superior mIoU, SeK, and OA on targeted semantic change detection tasks (SECOND, CNAM-CD), cross-domain building change detection (WHU-CD, LEVIR-CD), and binary change detection when compared to traditional and previous semantic baselines. Pretraining on synthetic data augments transferability.

3. Interpretability and Design Principles

Across all domains, RCDNet is distinguished by direct correspondences between network modules and interpretable algorithmic operations:

Dictionary layers embody explicit spatial priors.
Proximal Net modules realize adaptive regularization (rain removal context).
Cross-attention Transformers in MDB connect individual spatial features to semantic token streams (remote sensing context).

Every operator can be traced to a mathematically grounded step or physical prior, facilitating visualization, debugging, and scientific reasoning about inference and generalization characteristics.

4. Implementation, Training Protocols, and Datasets

Image Deraining

Network stages: typically $S=17$ , kernels $N=32$ , with optimization via Adam (initial $1\times10^{-3}$ , 100 epochs, patch size $64 \times 64$ ).
Datasets: Rain100L/H, Rain1400, Rain12, SPA-Data.
White-box operation: each parameter learned end-to-end, all modules inspectable.

Change Detection

Datasets: SECOND, CNAM-CD (semantic); WHU-CD, LEVIR-CD (binary/cross-domain).
CLIP ViT-B/32 used for text encoding, optionally LoRA-fine-tuned.
Network built on VMamba backbone, trained via AdamW (initial $6\times10^{-5}$ , 200 epochs, batch 4).
Synthetic data: RCDGen pipeline pretrains on 55,000 synthetic pairs.

5. Empirical Summary

Domain	Architecture Principle	Interpretable Modules	Best Results (Typical)
Single-image deraining	Dictionary + proximal deep unfolding	rain kernels, proxNets	Rain100L: PSNR 40.00, SSIM 0.986
Change detection	Siamese, cross-modal fusion	visual state-space, fusion, cross-attention	SECOND: mIoU 73.04, OA 89.00

RCDNet in both contexts demonstrates state-of-the-art performance, modular interpretability, and generalization to unseen or mixed training/test distributions. The central theme—modular architectures rooted in optimization modeling or cross-modal fusion—enables both visual inspection of learned components and robust empirical results.

6. Significance and Future Directions

RCDNet exemplifies the methodological trend toward embedding explicit priors—physical (rain streaks) or semantic (change type)—into deep architectures whose main learning task is structured, interpretable regularization or fusion. By aligning neural layers with proximal gradient steps or cross-attention mediations, RCDNet achieves:

End-to-end trainable, easily inspectable models.
Robustness to domain shifts (via adaptive kernels or synthetic data pretraining).
Applicability across vision domains: restoration, change detection, semantic object removal.

A plausible implication is that the RCDNet approach—physical modeling + unfolded or cross-modal fusion architectures, with clear module semantics—is extensible to other structured image analysis domains (e.g., medical imaging artifact removal, targeted object manipulation) where both interpretability and generalization are needed.