Papers
Topics
Authors
Recent
Search
2000 character limit reached

ViBiDSampler: Bidirectional Diffusion Interpolation

Updated 2 May 2026
  • ViBiDSampler is a bidirectional diffusion sampling strategy that generates temporally coherent video interpolations by leveraging sequential, bidirectional denoising paths and advanced guidance mechanisms.
  • It employs Classifier-Free Guidance++ and Decomposed Diffusion Scaling to mitigate off-manifold artifacts and improve frame fidelity, as evidenced by lower LPIPS, FID, and FVD scores on challenging benchmarks.
  • Implemented on the Stable Video Diffusion backbone, ViBiDSampler requires no fine-tuning and efficiently interpolates high-resolution videos, offering state-of-the-art performance in bounded video interpolation.

ViBiDSampler is a bidirectional diffusion sampling strategy designed to overcome the limitations of previous image-to-video diffusion models for bounded video interpolation, specifically in the context of generating intermediate frames given two keyframes. By leveraging sequential bidirectional paths, as well as advanced guidance mechanisms including Classifier-Free Guidance++ (CFG++) and Decomposed Diffusion Scaling (DDS), ViBiDSampler achieves state-of-the-art results in synthesizing high-quality, temporally coherent videos without requiring model fine-tuning or extensive iterative re-noising (Yang et al., 2024).

1. Video Interpolation in Diffusion Models

The video keyframe interpolation problem is defined as generating a temporally smooth and coherent sequence of intermediate frames {I1,…,IN−1}\{I_1,\dots,I_{N-1}\} given two boundary keyframes I0I_0 (start) and INI_N (end). Modern diffusion-based image-to-video models formulate the process via a stochastic forward noising transition:

q(xt∣xt−1)=N(xt;αtxt−1,(1−αt)I),q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, (1-\alpha_t)\mathbf{I}),

corrupting a clean latent x0x_0 to nearly Gaussian noise xTx_T. The reverse process,

pθ(xt−1∣xt)=N(xt−1; μθ(xt,t), Σθ(t)),p_\theta(x_{t-1} \mid x_t) = \mathcal{N}(x_{t-1};\,\mu_\theta(x_t, t),\,\Sigma_\theta(t)),

restores x0x_0 from xTx_T (e.g., using DDPM or Euler methods).

Conditioning the denoising process on a single frame is well-handled using Classifier-Free Guidance (CFG), which modifies predicted noise as

ϵ^(xt)=ϵ^∅(xt)+ω[ϵ^c(xt)−ϵ^∅(xt)],\hat\epsilon(x_t) = \hat\epsilon_\varnothing(x_t) + \omega[\hat\epsilon_c(x_t) - \hat\epsilon_\varnothing(x_t)],

where I0I_00 denotes the guidance scale, and I0I_01, I0I_02 are the U-Net’s predictions with and without conditional input.

For two-frame conditioned generation, naive approaches fuse independent forward (from I0I_03) and backward (from I0I_04) passes. Common fusion strategies, such as linear latent interpolation, often yield samples that are off the learned diffusion data manifold, resulting in artifacts and a requirement for additional re-noising or fine-tuning.

2. Sequential Bidirectional Sampling

ViBiDSampler introduces a strictly sequential, bidirectional denoising strategy to maintain on-manifold sampling throughout the interpolation process. At each time step I0I_05, the algorithm alternates between forward and backward conditioned paths:

  1. Forward Denoising: I0I_06 denoises conditioned on the start keyframe.
  2. Re-noising: I0I_07 lifts the forward latent back to the appropriate noise level.
  3. Time-Flip: I0I_08 reverses the temporal axis for alignment with the backward process.
  4. Backward Denoising: I0I_09 denoises conditioned on the end keyframe.
  5. Reverse Time-Flip: INI_N0 returns to the forward orientation.

This strategy ensures that each intermediate latent remains on the diffusion manifold, addressing the manifold mismatch issue observed in fusion-based approaches.

3. Guidance Enhancements: CFG++ and DDS

ViBiDSampler enhances denoising via two mechanisms:

  • Classifier-Free Guidance++ (CFG++): CFG++ modifies the Euler denoising step to compute

INI_N1

using the unconditional score INI_N2 in the correction term to mitigate off-manifold drift. Empirical ablations indicate that increasing guidance scale INI_N3 up to 1.0 improves frame fidelity, as seen in reduced LPIPS, FID, and FVD.

  • Decomposed Diffusion Scaling (DDS): DDS aligns the last-frame latent of the forward path to the target endpoint by solving

INI_N4

where INI_N5 extracts the last-frame latent and INI_N6 is the encoded keyframe. DDS can be symmetrically applied to both paths, further reducing divergence at the sequence endpoints.

4. Implementation and Algorithmic Pipeline

ViBiDSampler is implemented on top of the Stable Video Diffusion (SVD) backbone in the EDM framework, utilizing a U-Net with temporal attention. Sampling requires 25 Euler steps per direction (total 50 NFE), guidance scale INI_N7, and a frame rate micro-conditioning of 4. All modifications are at inference time; zero fine-tuning of the SVD model is required.

The implemented pipeline operates as follows: INI_N8 On a single NVIDIA RTX 3090 GPU, ViBiDSampler interpolates 25 frames at 1024×576 resolution in 195 seconds.

5. Empirical Results

Quantitative and qualitative evaluations demonstrate ViBiDSampler’s efficacy. On DAVIS and Pexels benchmarks, the method achieves state-of-the-art on LPIPS, FID, and FVD metrics. Integration of CFG++ and DDS yields substantial improvements over both vanilla ViBiDSampler and prior methods:

Method DAVIS (LPIPS/FID/FVD) Pexels (LPIPS/FID/FVD)
FILM 0.2697 / 40.24 / 833.8 0.0821 / 25.62 / 559.2
TRF 0.3102 / 60.28 / 622.2 0.2222 / 80.62 / 881.0
DynamiCrafter 0.3274 / 46.85 / 538.4 0.1922 / 49.48 / 604.2
Generative Inbetweening 0.2823 / 36.27 / 490.3 0.1523 / 40.47 / 746.3
Ours (Vanilla) 0.3031 / 52.45 / 543.3 0.2074 / 63.24 / 717.4
Ours (CFG++ only) 0.2571 / 41.96 / 434.4 0.1524 / 41.35 / 478.4
Ours (Full: +DDS) 0.2355 / 35.66 / 399.2 0.1366 / 37.34 / 452.3

Inference throughput and resource usage are competitive, with ViBiDSampler matching or surpassing contemporaries such as TRF and Generative Inbetweening in terms of both efficiency and output quality at full HD resolutions.

6. Limitations and Future Prospects

ViBiDSampler's effectiveness depends on the choice of CFG++ scale and frame-rate micro-conditioning; suboptimal parameters can degrade performance. Occasional failures remain in cases of extreme occlusion or very large inter-frame motion. The current framework is limited to interpolation between two boundary keyframes; extension to longer or unbounded sequences is proposed for future research. There is potential for improved robustness via adaptive step schedules, learned re-noising strengths, or applications to text-to-video diffusion tasks.

7. Significance and Comparative Analysis

ViBiDSampler establishes a new paradigm for diffusion-based video interpolation under two-frame constraints. By eschewing parallel fusion (which causes off-manifold artifacts) in favor of a strictly sequential, bidirectional approach, the method demonstrates empirically superior fidelity, temporal coherence, and practical inference efficiency. Its training-free, plug-and-play operation with Strong U-Net backbones (e.g., Stable Video Diffusion) is readily extensible to a broad range of video synthesis contexts (Yang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ViBiDSampler.