Zero-Shot Video Deraining Methods
- The paper introduces a zero-shot video deraining framework that leverages pretrained diffusion models to remove rain artifacts without paired training data.
- It employs negative prompting and block-wise attention switching to steer the generative process away from rain, preserving structural and temporal consistency.
- Experimental evaluations on benchmarks like NTURain and GT-Rain show improved artifact removal and temporal coherence with strong quantitative metrics.
Zero-shot video deraining methods constitute a class of restoration algorithms that utilize pretrained (often generative, diffusion-based) models to remove rain artifacts from videos without requiring task-specific, paired training data or supervised fine-tuning. Recent approaches exploit the powerful priors of large-scale diffusion models—either pretrained on images or videos—and offer generalized, data-agnostic solutions that demonstrate improved robustness to complex, real-world rain, occlusion, and motion scenarios. Unlike traditional or finetuned methods, these frameworks operate in a zero-shot paradigm, leveraging mechanisms such as prompt manipulation, attention reweighting, and temporal consistency guidance to achieve both artifact removal and preservation of temporal coherence across dynamic scenes.
1. Pretrained Diffusion Models and Inversion Schemes
Zero-shot video deraining methods are built atop large-scale diffusion models trained either on images or videos. Two key paradigms have emerged: one adapts pretrained image diffusion models by introducing cross-frame mechanisms, while the other employs video diffusion models directly and manipulates their conditional generation process.
For direct video-diffusion-based approaches, the input video is first inverted into the latent space of the diffusion backbone—for example, CogVideoX. The inversion must guarantee editability while retaining high-fidelity reconstruction of the original content. Among various inversion techniques, DDPM inversion (following the algorithm of Huberman-Spiegelglas et al. [HP24]) yields superior structural fidelity and supports subsequent conditional interventions. The forward process is described by
while the reverse, edit-friendly inversion iteratively recovers from and stored model-predicted noise estimates under particular conditions (Varanka et al., 23 Nov 2025).
2. Negative Prompting and Deraining via Conditional Guidance
To achieve rain removal, these frameworks harness "negative prompting," a variant of classifier-free guidance that steers the generative process away from undesired (rain-related) content encoded in the model's learned representations. At every diffusion step (beyond an initial skip to preserve coarse structural information), noise predictors are manipulated as
where is a rain-related prompt (e.g., "light rain") and denotes the null prompt. The difference directionally encourages sampling away from rain concepts, with acting as the amplification scale. This negative guidance is central to the zero-shot rain removal mechanism, providing strong generalization to unseen real-world precipitation (Varanka et al., 23 Nov 2025).
3. Temporal Structure and Attention Manipulation
Temporal stability and structural fidelity are maintained through two interlinked mechanisms:
- Attention Switching Module: Applying negative prompting naively can introduce visual artifacts—blurring, color shifts, or spatial warping—due to indiscriminate global alteration of conditional representations. To counteract this, attention key/value representations in select transformer blocks are swapped to those from the null prompt specifically for textual (prompted) cues, while image features remain conditioned on the "rain" prompt. This block-wise attention switch (typically early $0$–$4$ and late $15$–$29$ blocks) preserves critical high-frequency and structure-consistent information while still effecting deraining guidance.
- Temporal Consistency Guidance (Image-to-Video): In image-based pipelines applied to video, short-term temporal correlation is injected by replacing spatial self-attention with cross-previous-frame attention. Letting be the current frame's features, attention is computed as
Further, an temporal loss is defined over RAFT-computed optical flow and occlusion masks, enforcing per-frame restoration consistency by classifier-free style gradient guidance in the denoising step (Cao et al., 2 Jul 2024).
4. Inference Pipeline and Algorithmic Workflow
Video Diffusion Model-Based Pipeline (Varanka et al., 23 Nov 2025)
- Invert input video into latent space using DDPM inversion, storing at all denoising steps.
- For down to $1$:
- For , compute negative prompt guidance on predicted noise as above.
- In selected transformer blocks, swap attention keys/values to those from null-prompted text features.
- Perform denoising update using guided .
- Reconstruct the derained latent and decode via VAE.
Image Diffusion Model Adapted to Video (Cao et al., 2 Jul 2024)
- Initialize all (for all frames) with identical noise.
- For :
- Predict noise, estimate clean samples, enforce framewise content constraint (replace projection by deraining operator).
- Compute and apply temporal consistency loss, sample and share inter-frame noise, blend under occlusion masks.
- Update each with mean shift by temporal guidance gradient and spatially shared noise.
- Output: , i.e., early-stopped denoised representations.
The following table summarizes core methodological distinctions:
| Approach | Main Temporal Module | Rain Removal Mechanism |
|---|---|---|
| (Varanka et al., 23 Nov 2025) | Attention switching in DiT transformer | Negative prompting |
| (Cao et al., 2 Jul 2024) | Cross-previous-frame attention, temporal gradient | Image-to-video deraining loss |
5. Quantitative and Qualitative Evaluation
Zero-shot video deraining models have been empirically validated on multiple real-rain benchmarks, including NTURain (real), GT-Rain, and internally-curated RealRain13. Evaluation metrics span non-reference image quality (CLIP-IQA ↓), perceptual frame quality (MUSIQ ↑), temporal consistency (Warp Error ↓), as well as standard PSNR/SSIM for available synthetic datasets.
A representative performance summary ((Varanka et al., 23 Nov 2025), main Table 2 and Appendix Table 1):
| Dataset | Method | CLIP-IQA ↓ | Warp ↓ | MUSIQ ↑ | PSNR ↑ | SSIM ↑ |
|---|---|---|---|---|---|---|
| NTURain | RainMamba | 0.734 | 0.057 | 56.84 | 37.87 | 0.9738 |
| NTURain | Ours | 0.324 | 0.047 | 55.43 | 27.66 | 0.8492 |
| GT-Rain | Ours | 0.075 | 0.0064 | 50.18 | — | — |
| RealRain13 | Ours | 0.447 | 0.015 | 57.00 | — | — |
The inclusion of the attention switching module yields significantly enhanced structural quality (CLIP-IQA 0.324 vs 0.437 w/o switch) and stable perceptual gains.
6. Ablation Studies, Limitations, and Adaptation
Systematic ablations confirm:
- Video inversion quality: DDPM inversion outperforms DDIM and SDEdit in editability and fidelity.
- Prompt engineering: "light rain" outperforms implicit null or mean-of-rain-embedding prompts.
- Attention switching specificity: Block selection is critical; early and late blocks work best.
- Guidance scale: Stronger guidance in select blocks (λ = 25 vs. 15) further reduces rain artifacts with minimal aesthetic cost.
Limitations include diminished deraining effectiveness under heavy or dense rain (misclassification as scene content), and reliance on the generative capacity of the backbone model. Real-time deployment is constrained by decoding speed (∼3 min per clip on H100 GPU). Proposed future directions involve gradient-based negative guidance, learned negative-prompt embeddings, and adaptive skip-timestep scheduling (Varanka et al., 23 Nov 2025).
For the image-diffusion–based approach (Cao et al., 2 Jul 2024), explicit adaptation is needed for rain residue handling: the projector enforcing coarse content matching must be replaced by a differentiable deraining operator, and robust optical flow is essential under heavy occlusions.
7. Relation to Prior Work and Generalization
Zero-shot video deraining methods contrast sharply with supervised, paired-data solutions, which struggle with generalization beyond synthetic or static-camera datasets. By leveraging the strong inductive priors of pretrained diffusion models, the zero-shot paradigm provides enhanced robustness across dynamic backgrounds and varied precipitation conditions, without per-video retraining or fine-tuning that could risk overfitting or prior degradation. This generalization, as validated on real-world benchmarks, marks a substantial advance in scalable video restoration and supports future research on unsupervised multi-artifact removal, adaptive prompt engineering, and efficient, robust temporal stabilizers (Varanka et al., 23 Nov 2025, Cao et al., 2 Jul 2024).