ViBiDSampler: Bidirectional Diffusion Interpolation
- ViBiDSampler is a bidirectional diffusion sampling strategy that generates temporally coherent video interpolations by leveraging sequential, bidirectional denoising paths and advanced guidance mechanisms.
- It employs Classifier-Free Guidance++ and Decomposed Diffusion Scaling to mitigate off-manifold artifacts and improve frame fidelity, as evidenced by lower LPIPS, FID, and FVD scores on challenging benchmarks.
- Implemented on the Stable Video Diffusion backbone, ViBiDSampler requires no fine-tuning and efficiently interpolates high-resolution videos, offering state-of-the-art performance in bounded video interpolation.
ViBiDSampler is a bidirectional diffusion sampling strategy designed to overcome the limitations of previous image-to-video diffusion models for bounded video interpolation, specifically in the context of generating intermediate frames given two keyframes. By leveraging sequential bidirectional paths, as well as advanced guidance mechanisms including Classifier-Free Guidance++ (CFG++) and Decomposed Diffusion Scaling (DDS), ViBiDSampler achieves state-of-the-art results in synthesizing high-quality, temporally coherent videos without requiring model fine-tuning or extensive iterative re-noising (Yang et al., 2024).
1. Video Interpolation in Diffusion Models
The video keyframe interpolation problem is defined as generating a temporally smooth and coherent sequence of intermediate frames given two boundary keyframes (start) and (end). Modern diffusion-based image-to-video models formulate the process via a stochastic forward noising transition:
corrupting a clean latent to nearly Gaussian noise . The reverse process,
restores from (e.g., using DDPM or Euler methods).
Conditioning the denoising process on a single frame is well-handled using Classifier-Free Guidance (CFG), which modifies predicted noise as
where 0 denotes the guidance scale, and 1, 2 are the U-Net’s predictions with and without conditional input.
For two-frame conditioned generation, naive approaches fuse independent forward (from 3) and backward (from 4) passes. Common fusion strategies, such as linear latent interpolation, often yield samples that are off the learned diffusion data manifold, resulting in artifacts and a requirement for additional re-noising or fine-tuning.
2. Sequential Bidirectional Sampling
ViBiDSampler introduces a strictly sequential, bidirectional denoising strategy to maintain on-manifold sampling throughout the interpolation process. At each time step 5, the algorithm alternates between forward and backward conditioned paths:
- Forward Denoising: 6 denoises conditioned on the start keyframe.
- Re-noising: 7 lifts the forward latent back to the appropriate noise level.
- Time-Flip: 8 reverses the temporal axis for alignment with the backward process.
- Backward Denoising: 9 denoises conditioned on the end keyframe.
- Reverse Time-Flip: 0 returns to the forward orientation.
This strategy ensures that each intermediate latent remains on the diffusion manifold, addressing the manifold mismatch issue observed in fusion-based approaches.
3. Guidance Enhancements: CFG++ and DDS
ViBiDSampler enhances denoising via two mechanisms:
- Classifier-Free Guidance++ (CFG++): CFG++ modifies the Euler denoising step to compute
1
using the unconditional score 2 in the correction term to mitigate off-manifold drift. Empirical ablations indicate that increasing guidance scale 3 up to 1.0 improves frame fidelity, as seen in reduced LPIPS, FID, and FVD.
- Decomposed Diffusion Scaling (DDS): DDS aligns the last-frame latent of the forward path to the target endpoint by solving
4
where 5 extracts the last-frame latent and 6 is the encoded keyframe. DDS can be symmetrically applied to both paths, further reducing divergence at the sequence endpoints.
4. Implementation and Algorithmic Pipeline
ViBiDSampler is implemented on top of the Stable Video Diffusion (SVD) backbone in the EDM framework, utilizing a U-Net with temporal attention. Sampling requires 25 Euler steps per direction (total 50 NFE), guidance scale 7, and a frame rate micro-conditioning of 4. All modifications are at inference time; zero fine-tuning of the SVD model is required.
The implemented pipeline operates as follows: 8 On a single NVIDIA RTX 3090 GPU, ViBiDSampler interpolates 25 frames at 1024×576 resolution in 195 seconds.
5. Empirical Results
Quantitative and qualitative evaluations demonstrate ViBiDSampler’s efficacy. On DAVIS and Pexels benchmarks, the method achieves state-of-the-art on LPIPS, FID, and FVD metrics. Integration of CFG++ and DDS yields substantial improvements over both vanilla ViBiDSampler and prior methods:
| Method | DAVIS (LPIPS/FID/FVD) | Pexels (LPIPS/FID/FVD) |
|---|---|---|
| FILM | 0.2697 / 40.24 / 833.8 | 0.0821 / 25.62 / 559.2 |
| TRF | 0.3102 / 60.28 / 622.2 | 0.2222 / 80.62 / 881.0 |
| DynamiCrafter | 0.3274 / 46.85 / 538.4 | 0.1922 / 49.48 / 604.2 |
| Generative Inbetweening | 0.2823 / 36.27 / 490.3 | 0.1523 / 40.47 / 746.3 |
| Ours (Vanilla) | 0.3031 / 52.45 / 543.3 | 0.2074 / 63.24 / 717.4 |
| Ours (CFG++ only) | 0.2571 / 41.96 / 434.4 | 0.1524 / 41.35 / 478.4 |
| Ours (Full: +DDS) | 0.2355 / 35.66 / 399.2 | 0.1366 / 37.34 / 452.3 |
Inference throughput and resource usage are competitive, with ViBiDSampler matching or surpassing contemporaries such as TRF and Generative Inbetweening in terms of both efficiency and output quality at full HD resolutions.
6. Limitations and Future Prospects
ViBiDSampler's effectiveness depends on the choice of CFG++ scale and frame-rate micro-conditioning; suboptimal parameters can degrade performance. Occasional failures remain in cases of extreme occlusion or very large inter-frame motion. The current framework is limited to interpolation between two boundary keyframes; extension to longer or unbounded sequences is proposed for future research. There is potential for improved robustness via adaptive step schedules, learned re-noising strengths, or applications to text-to-video diffusion tasks.
7. Significance and Comparative Analysis
ViBiDSampler establishes a new paradigm for diffusion-based video interpolation under two-frame constraints. By eschewing parallel fusion (which causes off-manifold artifacts) in favor of a strictly sequential, bidirectional approach, the method demonstrates empirically superior fidelity, temporal coherence, and practical inference efficiency. Its training-free, plug-and-play operation with Strong U-Net backbones (e.g., Stable Video Diffusion) is readily extensible to a broad range of video synthesis contexts (Yang et al., 2024).