Zero-Shot Video Deraining Methods

Updated 30 November 2025

The paper introduces a zero-shot video deraining framework that leverages pretrained diffusion models to remove rain artifacts without paired training data.
It employs negative prompting and block-wise attention switching to steer the generative process away from rain, preserving structural and temporal consistency.
Experimental evaluations on benchmarks like NTURain and GT-Rain show improved artifact removal and temporal coherence with strong quantitative metrics.

Zero-shot video deraining methods constitute a class of restoration algorithms that utilize pretrained (often generative, diffusion-based) models to remove rain artifacts from videos without requiring task-specific, paired training data or supervised fine-tuning. Recent approaches exploit the powerful priors of large-scale diffusion models—either pretrained on images or videos—and offer generalized, data-agnostic solutions that demonstrate improved robustness to complex, real-world rain, occlusion, and motion scenarios. Unlike traditional or finetuned methods, these frameworks operate in a zero-shot paradigm, leveraging mechanisms such as prompt manipulation, attention reweighting, and temporal consistency guidance to achieve both artifact removal and preservation of temporal coherence across dynamic scenes.

1. Pretrained Diffusion Models and Inversion Schemes

Zero-shot video deraining methods are built atop large-scale diffusion models trained either on images or videos. Two key paradigms have emerged: one adapts pretrained image diffusion models by introducing cross-frame mechanisms, while the other employs video diffusion models directly and manipulates their conditional generation process.

For direct video-diffusion-based approaches, the input video $x_0$ is first inverted into the latent space of the diffusion backbone—for example, CogVideoX. The inversion must guarantee editability while retaining high-fidelity reconstruction of the original content. Among various inversion techniques, DDPM inversion (following the algorithm of Huberman-Spiegelglas et al. [HP24]) yields superior structural fidelity and supports subsequent conditional interventions. The forward process is described by

$\bar\alpha_t = \prod_{s=1}^t \alpha_s, \quad x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\tilde\epsilon_t, \quad \tilde\epsilon_t\sim\mathcal N(0,I),$

while the reverse, edit-friendly inversion iteratively recovers $x_{t-1}$ from $x_t$ and stored model-predicted noise estimates under particular conditions (Varanka et al., 23 Nov 2025).

2. Negative Prompting and Deraining via Conditional Guidance

To achieve rain removal, these frameworks harness "negative prompting," a variant of classifier-free guidance that steers the generative process away from undesired (rain-related) content encoded in the model's learned representations. At every diffusion step $t$ (beyond an initial skip $t_s$ to preserve coarse structural information), noise predictors are manipulated as

$\hat\epsilon_\theta(x_t,t) = \epsilon_\theta(x_t,t,\emptyset) + \lambda\left(\epsilon_\theta(x_t,t,\emptyset) - \epsilon_\theta(x_t,t,c)\right),$

where $c$ is a rain-related prompt (e.g., "light rain") and $\emptyset$ denotes the null prompt. The difference directionally encourages sampling away from rain concepts, with $\lambda$ acting as the amplification scale. This negative guidance is central to the zero-shot rain removal mechanism, providing strong generalization to unseen real-world precipitation (Varanka et al., 23 Nov 2025).

3. Temporal Structure and Attention Manipulation

Temporal stability and structural fidelity are maintained through two interlinked mechanisms:

Attention Switching Module: Applying negative prompting naively can introduce visual artifacts—blurring, color shifts, or spatial warping—due to indiscriminate global alteration of conditional representations. To counteract this, attention key/value representations in select transformer blocks are swapped to those from the null prompt specifically for textual (prompted) cues, while image features remain conditioned on the "rain" prompt. This block-wise attention switch (typically early $0$–$4$ and late $15$–$29$ blocks) preserves critical high-frequency and structure-consistent information while still effecting deraining guidance.
Temporal Consistency Guidance (Image-to-Video): In image-based pipelines applied to video, short-term temporal correlation is injected by replacing spatial self-attention with cross-previous-frame attention. Letting $v_i$ be the current frame's features, attention is computed as

$Q = W^Q v_i,\quad K'=W^K v_{i-1},\quad V'=W^V v_{i-1},\quad \mathrm{CrossPrev}(v_i, v_{i-1}) = \mathrm{Softmax}(Q{K'}^\top/\sqrt{d}) V'.$

Further, an $L^1$ temporal loss is defined over RAFT-computed optical flow and occlusion masks, enforcing per-frame restoration consistency by classifier-free style gradient guidance in the denoising step (Cao et al., 2 Jul 2024).

4. Inference Pipeline and Algorithmic Workflow

Invert input video into latent space using DDPM inversion, storing $(x_t, z_t)$ at all denoising steps.
For $t=T$ $t = T$ down to $1$:
- For $t > t_s$ , compute negative prompt guidance on predicted noise as above.
- In selected transformer blocks, swap attention keys/values to those from null-prompted text features.
- Perform denoising update using guided $\epsilon_\theta$ .
Reconstruct the derained latent $x_0$ and decode via VAE.

Initialize all $x_T^i$ (for all frames) with identical $\mathcal N(0, I)$ noise.
For $t = T, ..., T_{ES} + 1$ $t = T, ..., T_{ES} + 1$ :
- Predict noise, estimate clean samples, enforce framewise content constraint (replace projection by deraining operator).
- Compute and apply temporal consistency loss, sample and share inter-frame noise, blend under occlusion masks.
- Update each $x_{t-1}^i$ with mean shift by temporal guidance gradient and spatially shared noise.
Output: $X_i = x_{T_{ES}}^i$ , i.e., early-stopped denoised representations.

The following table summarizes core methodological distinctions:

Approach	Main Temporal Module	Rain Removal Mechanism
(Varanka et al., 23 Nov 2025)	Attention switching in DiT transformer	Negative prompting
(Cao et al., 2 Jul 2024)	Cross-previous-frame attention, temporal gradient	Image-to-video deraining loss

5. Quantitative and Qualitative Evaluation

Zero-shot video deraining models have been empirically validated on multiple real-rain benchmarks, including NTURain (real), GT-Rain, and internally-curated RealRain13. Evaluation metrics span non-reference image quality (CLIP-IQA ↓), perceptual frame quality (MUSIQ ↑), temporal consistency (Warp Error ↓), as well as standard PSNR/SSIM for available synthetic datasets.

A representative performance summary ((Varanka et al., 23 Nov 2025), main Table 2 and Appendix Table 1):

Dataset	Method	CLIP-IQA ↓	Warp ↓	MUSIQ ↑	PSNR ↑	SSIM ↑
NTURain	RainMamba	0.734	0.057	56.84	37.87	0.9738
NTURain	Ours	0.324	0.047	55.43	27.66	0.8492
GT-Rain	Ours	0.075	0.0064	50.18	—	—
RealRain13	Ours	0.447	0.015	57.00	—	—

The inclusion of the attention switching module yields significantly enhanced structural quality (CLIP-IQA 0.324 vs 0.437 w/o switch) and stable perceptual gains.

6. Ablation Studies, Limitations, and Adaptation

Systematic ablations confirm:

Video inversion quality: DDPM inversion outperforms DDIM and SDEdit in editability and fidelity.
Prompt engineering: "light rain" outperforms implicit null or mean-of-rain-embedding prompts.
Attention switching specificity: Block selection is critical; early and late blocks work best.
Guidance scale: Stronger guidance in select blocks (λ = 25 vs. 15) further reduces rain artifacts with minimal aesthetic cost.

Limitations include diminished deraining effectiveness under heavy or dense rain (misclassification as scene content), and reliance on the generative capacity of the backbone model. Real-time deployment is constrained by decoding speed (∼3 min per clip on H100 GPU). Proposed future directions involve gradient-based negative guidance, learned negative-prompt embeddings, and adaptive skip-timestep scheduling (Varanka et al., 23 Nov 2025).

For the image-diffusion–based approach (Cao et al., 2 Jul 2024), explicit adaptation is needed for rain residue handling: the projector enforcing coarse content matching must be replaced by a differentiable deraining operator, and robust optical flow is essential under heavy occlusions.

7. Relation to Prior Work and Generalization

Zero-shot video deraining methods contrast sharply with supervised, paired-data solutions, which struggle with generalization beyond synthetic or static-camera datasets. By leveraging the strong inductive priors of pretrained diffusion models, the zero-shot paradigm provides enhanced robustness across dynamic backgrounds and varied precipitation conditions, without per-video retraining or fine-tuning that could risk overfitting or prior degradation. This generalization, as validated on real-world benchmarks, marks a substantial advance in scalable video restoration and supports future research on unsupervised multi-artifact removal, adaptive prompt engineering, and efficient, robust temporal stabilizers (Varanka et al., 23 Nov 2025, Cao et al., 2 Jul 2024).

PDF Markdown Chat (Pro)

References (2)

Zero-Shot Video Deraining with Video Diffusion Models (2025)

Zero-Shot Video Restoration and Enhancement Using Pre-Trained Image Diffusion Model (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Zero-Shot Video Deraining Method.

Zero-Shot Video Deraining Methods

1. Pretrained Diffusion Models and Inversion Schemes

2. Negative Prompting and Deraining via Conditional Guidance

3. Temporal Structure and Attention Manipulation

4. Inference Pipeline and Algorithmic Workflow

Video Diffusion Model-Based Pipeline (Varanka et al., 23 Nov 2025)

Image Diffusion Model Adapted to Video (Cao et al., 2 Jul 2024)

5. Quantitative and Qualitative Evaluation

6. Ablation Studies, Limitations, and Adaptation

7. Relation to Prior Work and Generalization

Whiteboard

Follow Topic

Continue Learning

Zero-Shot Video Deraining Methods

1. Pretrained Diffusion Models and Inversion Schemes

2. Negative Prompting and Deraining via Conditional Guidance

3. Temporal Structure and Attention Manipulation

4. Inference Pipeline and Algorithmic Workflow

Video Diffusion Model-Based Pipeline (Varanka et al., 23 Nov 2025)

Image Diffusion Model Adapted to Video (Cao et al., 2 Jul 2024)

5. Quantitative and Qualitative Evaluation

6. Ablation Studies, Limitations, and Adaptation

7. Relation to Prior Work and Generalization

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics