VipDiff: Training-Free Video Inpainting

Updated 5 January 2026

VipDiff is a framework that uses training-free denoising diffusion models with optical flow guidance to restore missing regions in videos while maintaining temporal coherence.
It integrates a pre-trained RAFT-based network for flow completion with iterative pixel propagation to ensure both spatial fidelity and reduced artifacts.
The method optimizes latent noise through a constrained reverse diffusion process, achieving state-of-the-art metrics on benchmarks like YouTube-VOS and DAVIS.

VipDiff refers to a class of frameworks and methods for video inpainting via training-free denoising diffusion models, specifically the approach introduced in "VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models" (Xie et al., 21 Jan 2025). This method addresses the deficiencies of flow- and frame-propagation based inpainting methods on videos—most notably their inability to fill large masked regions with coherent and diverse content—and leverages pre-trained diffusion models alongside optical flow guidance to deliver temporally coherent and diverse inpainting without any additional model training.

1. Problem Formulation and Motivation

Video inpainting comprises the restoration of missing or occluded regions across sequences $\mathcal{X} = \{x_0^k \mid k = 1, \ldots, N\}$ , given binary masks $\mathcal{M} = \{m^k\}$ indicating holes, as

$x_0^k = x^k \odot (1-m^k)$

for each frame $k$ . The principal objectives are the plausible filling of holes (visual fidelity), consistency of synthesized regions temporally (coherence), and the capacity to generate multiple, diverse valid solutions.

Flow-based propagation methods, which transfer pixels from reference frames into target holes using optical flow, suffer from severe degradation when the mask is large or when flow has no valid correspondences. In these contexts, artifact-laden or blurry inpainted regions emerge. Conversely, diffusion models excel at generating diverse, high-quality image content, but naive frame-by-frame application loses temporal consistency. The core motivation of VipDiff is to achieve video-level coherence and diversity in inpainting by harmonizing diffusion-based synthesis with flow-guided constraints—without any fine-tuning or additional supervision.

2. VipDiff Pipeline and Methodological Components

VipDiff operates on a per-frame basis while propagating completed pixel information forward and backward. Each target frame $k$ undergoes the following stages:

a. Flow Completion

A pre-trained RAFT-based network $F$ predicts dense flows, including inside masked regions, as

$\tilde f_{k \rightarrow j} = F(x_0^k, x_0^j, m^k, m^j)$

b. Optical Flow-Guided Pixel Propagation

For each reference frame $j$ , backward warping is performed:

$\tilde x_0^k \leftarrow x_0^k + m^{(j \rightarrow k)} \odot \omega(x_0^j, \tilde f_{k \rightarrow j})$

where $m^{(j \rightarrow k)} = m^k \odot [1-\omega(m^j, \tilde f_{k \rightarrow j})]$ . Propagation continues until the invalid mask $\tilde m^k$ is depleted or references are exhausted. Color and brightness mismatches across frames are mitigated by an error-compensation network.

c. Training-Free Diffusion with Constrained Reverse Process

With partial holes remaining, constrained denoising diffusion is used. A latent noise $z_T$ is optimized (rather than model parameters) to ensure the decoded frame $\hat{y}^k$ fits the unmasked and propagated pixels:

$L_{\text{cond}}(z) = \|[\hat{y}^k \odot (1-\tilde m^k)] - [\tilde x_0^k \odot (1-\tilde m^k)]\|^2 + \gamma \|z_T - z_0\|^2$

with $z_0 \sim \mathcal{N}(0,I)$ and $\gamma \approx 10^{-3}$ . Gradient descent is performed over $z_T$ , not the model, for around 50 diffusion iterations per frame.

d. Iterative Propagation

Once a frame is fully inpainted, its completed pixels are used as new references for subsequent (neighboring) frames, reducing the need to apply diffusion to every frame.

3. Diffusion Model Foundations

VipDiff leverages pre-trained latent diffusion models (LDMs) as fixed backbones. Framewise noising/diffusion processes are standard:

Forward process: $q(x_t^k | x_{t-1}^k) = \mathcal{N}(x_t^k; \sqrt{1-\beta_t} \, x_{t-1}^k, \beta_t I)$
Reverse process: $p_\theta(x_{t-1}^k | x_t^k) = \mathcal{N}(x_{t-1}^k; \mu_\theta(x_t^k, t), \sigma_t^2 I)$ with $\epsilon_\theta(x_t, t)$ predicting the noise.

Conditional inpainting is implemented by searching for noise seeds such that the sampled output matches known/unmasked regions. No model or fine-tuning is required.

4. Temporal Consistency and Diversity

Temporal coherence is maintained by propagating inpainted pixels as anchor constraints: after a frame’s successful inpainting, resulting pixel values are “hard-wired” into the warping process for neighboring frames, enforcing consistency through flow. Diversity is naturally achieved by drawing multiple $z_0$ samples and running independent noise optimization, resulting in diverse yet temporally plausible hole-fills. Metrics such as LPIPS or VFID quantify this diversity/fidelity tradeoff.

5. Implementation and Computational Aspects

Diffusion Backbone and Flow Models

LDM (Rombach et al. ICCV 2022) with $T=1000$ steps, $\beta$ -schedule.
Flow completion: RAFT (Teed & Deng, ECCV 2020).
Color compensation: model from Kang et al. (ECCV 2022).

Optimization

50 forward-backward iterations per frame ( $\eta_0=0.01$ , decayed by $0.9$ at each step), regularization $\gamma=10^{-3}$ .
No model weights updated; only input noise $\mathbf{z}$ is optimized.
Computational cost: $\sim$ 2.7s/frame on DAVIS dataset (RTX 3090).

6. Experimental Evaluation

VipDiff achieves strong benchmarks on YouTube-VOS and DAVIS (432x240 resolution):

Dataset	PSNR	SSIM	VFID	$E_{\text{warp}}$
YouTube-VOS	34.21	0.9773	0.041	0.0828
DAVIS	34.23	0.9745	0.102	0.1280

SSIM and VFID distinctly surpass state-of-the-art methods. Comparison to ProPainter, ECFVI, and FGT demonstrates improved sharpness and semantic plausibility, especially under large or object-shaped masks.
Ablation studies confirm that pixel propagation alone or naive LDM per-frame inpainting yield unaligned or temporally unstable videos. Full VipDiff, including flow and noise optimization, is required for optimal PSNR/SSIM/VFID/ $E_{\text{warp}}$ .

7. Limitations and Outlook

The principal limitations of VipDiff are computational: the per-frame reverse-diffusion optimization induces higher latency than feedforward inpainting networks. Flow-based propagation can also propagate artifacts in cases of fast motion or large occlusions if the flow estimation is incorrect—a limitation inherent to current RAFT architectures. Anticipated future improvements include fast diffusion samplers and integrated flow+diffusion optimization to mitigate these issues.

8. Summary and Broader Significance

VipDiff demonstrates that a training-free, diffusion-based pipeline with optical flow guidance can achieve both state-of-the-art spatial fidelity and temporal coherence for video inpainting, matching or exceeding purpose-trained models—even under large or structurally complex masks. The capacity for diverse sampling via input noise optimization and the complete avoidance of any additional model training on video datasets position VipDiff as a significant advance for practical, adaptable video inpainting in domains where re-training is infeasible or unaffordable (Xie et al., 21 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

VipDiff: Towards Coherent and Diverse Video Inpainting via Training-free Denoising Diffusion Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VipDiff.