Emergent Mind

Video Interpolation with Diffusion Models

(2404.01203)
Published Apr 1, 2024 in cs.CV

Abstract

We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.
Comparison of FID scores between VIDIM and baseline model at varying guidance weights.

Overview

  • Video Interpolation with Diffusion Models (VIDIM) introduces a novel approach using cascaded diffusion models to create intermediate frames for videos, aimed at increasing frame rate or generating slow-motion effects.

  • VIDIM leverages a two-step generative process combining a base diffusion model for initial low-resolution video generation and a super-resolution model for final high-quality output.

  • It outperforms existing methods in scenarios with complex and ambiguous motions, ensuring high-quality, plausible videos through architectural innovations like UNet adaptation and classifier-free guidance.

  • Empirical evaluations and user studies confirm VIDIM's superiority in producing realistic and temporally consistent videos, promising advancements in video interpolation and processing.

Introduction

Video interpolation involves creating intermediate frames between two consecutive frames of a video, aiming to either increase the frame rate or generate slow-motion videos. Traditional methods have largely relied on linear motion estimations or optical flow algorithms, which often struggle with complex, non-linear motions or ambiguous scenarios. In this study, we introduce Video Interpolation with Diffusion Models (VIDIM), a novel generative approach that leverages cascaded diffusion models to tackle these challenges head-on. VIDIM significantly outperforms existing state-of-the-art methods in handling complex and ambiguous motion, generating high-quality, plausible videos even in the toughest scenarios.

Methodology

Cascaded Diffusion Models for Video Generation

VIDIM's architecture employs a two-step generative process. Initially, it generates the target video at a lower resolution using a base diffusion model conditioned on start and end frames. Subsequently, a super-resolution model conditioned on this low-resolution video and the original high-resolution frames synthesizes the final high-resolution video. This cascaded approach, inspired by previous successes in the field, ensures that VIDIM can capture fine details and maintain temporal consistency across frames.

Architectural Innovations and Training Regimen

The study introduces several key innovations in the model's architecture and training process. Notably, VIDIM uses a UNet architecture adapted for video by permitting mixing of feature maps across frames through temporal attention blocks. Furthermore, it incorporates a novel technique for frame conditioning that involves setting fake noise levels for the conditioning frames, enabling information from these frames to propagate through the network without extra parameters. The models employ classifier-free guidance to dramatically enhance sample quality, a critical factor in achieving realistic video interpolation results.

During training, VIDIM models are optimized using a continuous-time objective based on the evidence lower bound (ELBO), with adjustments for video-specific dynamics. Training leverages large-scale video datasets, with procedures in place to filter out undesirable examples, such as those with rapid scene cuts, ensuring that the models learn from relevant data.

Empirical Evaluation

Benchmarking Against State-of-the-Art

VIDIM's performance was extensively evaluated against several state-of-the-art video interpolation methods across challenging datasets derived from the Davis and UCF101 collections. The evaluation focused on both generative metrics, such as Frechét Video Distance (FVD), and traditional reconstruction-based metrics. VIDIM consistently outshone the baseline models, especially in scenarios characterized by large and ambiguous motion, validating its superior capability to generate plausible and temporally consistent videos.

User Study

A user study involving video quadruplets generated from the same input frame pairs accentuated VIDIM's advantages. Participants overwhelmingly preferred VIDIM-generated videos over those produced by baseline models, underlining its effectiveness in producing high-quality, realistic videos even under difficult conditions.

Ablations and Further Insights

The study carried out ablations to dissect the contributions of various components, particularly highlighting the importance of explicit frame conditioning and classifier-free guidance in achieving optimal results. Scalability tests further demonstrated VIDIM's capacity to improve with larger models, though balancing the parameter count in both base and super-resolution models was crucial for maximizing quality.

Conclusion and Future Directions

VIDIM represents a significant advancement in video interpolation, notably for scenarios that have historically posed challenges for generative models. By leveraging cascaded diffusion models and novel architectural tweaks, VIDIM sets new standards for video interpolation quality. Future work might explore its application to other video generation tasks, extend its capabilities to arbitrary aspect ratios, or further refine super-resolution models to enhance quality. The findings promise exciting developments in video processing and generative modeling, paving the way for more realistic and complex video generation tasks.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Reddit
[2404.01203] Video Interpolation with Diffusion Models (1 point, 0 comments) in /r/ninjasaid13
References
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision
  2. A database and evaluation methodology for optical flow. In ICCV
  3. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575
  4. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), pages 611–625
  5. WaveGrad: Estimating Gradients for Waveform Generation
  6. Fsrnet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2492–2501
  7. Pixel recursive super resolution. In Proceedings of the IEEE international conference on computer vision, pages 5439–5448
  8. St-mfnet: A spatio-temporal multi-flow network for frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3521–3531
  9. LDMVFI: Video Frame Interpolation with Latent Diffusion Models
  10. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR
  11. Video frame interpolation: A comprehensive survey. ACM Trans. Multimedia Comput. Commun. Appl., 19(2s)
  12. Generative adversarial networks. Communications of the ACM, 63(11):139–144
  13. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30
  14. Classifier-Free Diffusion Guidance
  15. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851
  16. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022a.
  17. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022b.
  18. Video Diffusion Models
  19. Simple diffusion: End-to-end diffusion for high resolution images
  20. Real-time intermediate flow estimation for video frame interpolation. In Proceedings of the European Conference on Computer Vision (ECCV)
  21. Super SloMo: High Quality Estimation of Multiple Intermediate Frames for Video Interpolation
  22. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707
  23. Adam: A Method for Stochastic Optimization
  24. Auto-Encoding Variational Bayes
  25. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  26. Amt: All-pairs multi-field transforms for efficient frame interpolation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  27. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309
  28. Enhanced Quadratic Video Interpolation
  29. Video frame interpolation with transformer
  30. Pulse: Self-supervised photo upsampling via latent space exploration of generative models. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 2437–2445
  31. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR
  32. Softmax Splatting for Video Frame Interpolation
  33. Asymmetric bilateral motion estimation for video frame interpolation. In International Conference on Computer Vision
  34. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI conference on artificial intelligence
  35. The 2017 DAVIS Challenge on Video Object Segmentation
  36. DreamFusion: Text-to-3D using 2D Diffusion
  37. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE
  38. Hierarchical Text-Conditional Image Generation with CLIP Latents
  39. Film: Frame interpolation for large motion. In European Conference on Computer Vision (ECCV)
  40. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695
  41. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer
  42. Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild
  43. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022a.
  44. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022b.
  45. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4713–4726, 2022c.
  46. Progressive Distillation for Fast Sampling of Diffusion Models
  47. XVFI: eXtreme Video Frame Interpolation
  48. Make-A-Video: Text-to-Video Generation without Text-Video Data
  49. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR
  50. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32
  51. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
  52. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958
  53. Raft: Recurrent all-pairs field transforms for optical flow
  54. Towards Accurate Generative Models of Video: A New Metric & Challenges
  55. Phenaki: Variable Length Video Generation From Open Domain Textual Description
  56. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in neural information processing systems, 35:23371–23385
  57. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612
  58. Novel View Synthesis with Diffusion Models
  59. Quadratic video interpolation. In Advances in Neural Information Processing Systems. Curran Associates, Inc.
  60. Video enhancement with task-oriented flow. International Journal of Computer Vision (IJCV), 127(8):1106–1125
  61. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation
  62. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595
  63. Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4606–4615

Show All 63

Test Your Knowledge

You answered out of questions correctly.

Well done!