Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LDMVFI: Video Frame Interpolation with Latent Diffusion Models (2303.09508v3)

Published 16 Mar 2023 in eess.IV and cs.CV

Abstract: Existing works on video frame interpolation (VFI) mostly employ deep neural networks that are trained by minimizing the L1, L2, or deep feature space distance (e.g. VGG loss) between their outputs and ground-truth frames. However, recent works have shown that these metrics are poor indicators of perceptual VFI quality. Towards developing perceptually-oriented VFI methods, in this work we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method on common test sets used in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with favorable perceptual quality compared to the state of the art, even in the high-resolution regime. Our code is available at https://github.com/danier97/LDMVFI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. A database and evaluation methodology for optical flow. Int. J. of Comput. Vis., 92(1): 1–31.
  2. MEMC-Net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. on Pattern Anal. Mach. Intell., 43(3): 933–948.
  3. Model compression. In Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 535–541.
  4. Video frame interpolation via deformable separable convolution. In Proc. of the AAAI Conf. on Artificial Intell., 10607–10614.
  5. Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Trans. on Pattern Anal. Mach. Intell., 44(10): 7029–7045.
  6. Channel attention is all you need for video frame interpolation. In Proc. of the AAAI Conf. on Artificial Intell., volume 34, 10663–10671.
  7. Deformable convolutional networks. In Proc. of the IEEE Int. Conf. on Comput. Vis., 764–773.
  8. Enhancing deformable convolution based video frame interpolation with coarse-to-fine 3D CNN. In IEEE Int. Conf. on Image Process., 1396–1400.
  9. FloLPIPS: A Bespoke Video Quality Metric for Frame Interpolation. In IEEE Picture Coding Symposium, 283–287.
  10. ST-MFNet: A spatio-temporal multi-flow network for frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 3521–3531.
  11. A subjective quality study for video frame interpolation. In IEEE Int. Conf. on Image Process., 1361–1365.
  12. Denton, E. 2021. Ethical considerations of generative ai. In AI for Content Creation Workshop, CVPR, volume 27.
  13. Diffusion models beat gans on image synthesis. Adv. in Neural Inform. Process. Syst., 34: 8780–8794.
  14. CDFI: Compression-Driven Network Design for Frame Interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 8001–8011.
  15. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Int. Conf. on Learn. Represent.
  16. Taming transformers for high-resolution image synthesis. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 12873–12883.
  17. Deepstereo: Learning to predict new views from the world’s imagery. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 5515–5524.
  18. Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
  19. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Adv. in Neural Inform. Process. Syst., 30.
  20. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  21. Denoising diffusion probabilistic models. Adv. in Neural Inform. Process. Syst., 33: 6840–6851.
  22. Cascaded Diffusion Models for High Fidelity Image Generation. J. of Machine Learning Research, 23(47): 1–33.
  23. Image-to-image translation with conditional adversarial networks. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1125–1134.
  24. ITU-R BT, R. 2002. 500-11, Methodology for the subjective assessment of the quality of television pictures,”. Int. Telecommunication Union, Tech. Rep.
  25. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 9000–9008.
  26. FLAVR: Flow-agnostic video representations for fast frame interpolation. In Proc. of the IEEE Winter Conf. on Applications of Comput. Vis., 2071–2082.
  27. NTIRE 2023 video colorization challenge. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1570–1581.
  28. Three-dimensional reconstruction of the digestive wall in capsule endoscopy videos using elastic video interpolation. IEEE Trans. on Medical Imaging, 30(4): 957–971.
  29. Elucidating the Design Space of Diffusion-Based Generative Models. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Adv. in Neural Inform. Process. Syst., volume 35, 26565–26577.
  30. Adam: A Method for Stochastic Optimization. In Int. Conf. on Learn. Represent.
  31. Auto-Encoding Variational Bayes. In Int. Conf. on Learn. Represent.
  32. IFRNet: Intermediate feature refine network for efficient frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1969–1978.
  33. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700.
  34. AdaCoF: Adaptive collaboration of flows for video frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 5316–5325.
  35. Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479: 47–59.
  36. Video frame synthesis using deep voxel flow. In Proc. of the IEEE Int. Conf. on Comput. Vis., 4463–4471.
  37. Decoupled Weight Decay Regularization. In Int. Conf. on Learn. Represent.
  38. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. in Neural Inform. Process. Syst., 35: 5775–5787.
  39. Video frame interpolation with transformer. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 3532–3542.
  40. BVI-DVC: A Training Database for Deep Video Compression. IEEE Trans. on Multimedia, 1–1.
  41. A study of high frame rate video formats. IEEE Trans. on Multimedia, 21(6): 1499–1512.
  42. Visual quality assessment for interpolated slow-motion videos based on a novel database. In IEEE Int. Conf. on Quality of Multimedia Experience, 1–6.
  43. ST-MFNet Mini: Knowledge Distillation-Driven Frame Interpolation. arXiv preprint arXiv:2302.08455.
  44. On the optimal presentation duration for subjective video quality assessment. IEEE Trans. on Circuit Syst. Video Technol., 26(11): 1977–1987.
  45. Context-aware synthesis for video frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1701–1710.
  46. Softmax splatting for video frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 5437–5446.
  47. Video frame interpolation via adaptive convolution. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 670–679.
  48. BMBC: Bilateral motion estimation with bilateral cost volume for video interpolation. In Comput. Vis.–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 109–125. Springer.
  49. Asymmetric Bilateral Motion Estimation for Video Frame Interpolation. In Proc. of IEEE Int. Conf. on Comput. Vis., 14539–14548.
  50. A benchmark dataset and evaluation methodology for video object segmentation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 724–732.
  51. Learning separable filters. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 2754–2761.
  52. High-resolution image synthesis with latent diffusion models. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 10684–10695.
  53. Perceptual Video Quality Assessment: The Journey Continues! Frontiers in Signal Processing, 3.
  54. Progressive Distillation for Fast Sampling of Diffusion Models. In Int. Conf. on Learn. Represent.
  55. XVFI: eXtreme Video Frame Interpolation. In Proc. of the IEEE Int. Conf. on Comput. Vis., 14489–14498.
  56. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Int. Conf. on Learn. Represent.
  57. Deep animation video interpolation in the wild. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 6587–6595.
  58. Deep unsupervised learning using nonequilibrium thermodynamics. In Int. Conf. on Machine Learning, 2256–2265. PMLR.
  59. Denoising Diffusion Implicit Models. In Int. Conf. on Learn. Represent.
  60. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  61. Maxvit: Multi-axis vision transformer. In Comput. Vis.–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, 459–479. Springer.
  62. Neural discrete representation learning. Adv. in Neural Inform. Process. Syst., 30.
  63. Attention is all you need. Adv. in Neural Inform. Process. Syst., 30.
  64. MCVD - Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In Adv. in Neural Inform. Process. Syst., volume 35, 23371–23385.
  65. Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Process., 13(4): 600–612.
  66. Video compression through image interpolation. In Proc. of the European Conf. on Comput. Vis., 416–431.
  67. Video enhancement with task-oriented flow. Int. J. of Comput. Vis., 127(8): 1106–1125.
  68. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796.
  69. Yang, R. 2021. NTIRE 2021 challenge on quality enhancement of compressed video: Dataset and study. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 667–676.
  70. Perceptual Learned Video Compression with Recurrent Conditional GAN. In Processings of the Int. Joint Conf. on Artificial Intell., 1537–1544.
  71. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 586–595.
Citations (36)

Summary

  • The paper presents LDMVFI, a novel approach that reframes video frame interpolation as a conditional generation problem using latent diffusion models and a custom VQ-FIGAN autoencoder.
  • It employs VQ-FIGAN with deformable kernel synthesis and MaxViT-based self-attention to integrate neighboring frame features and improve latent representations.
  • Rigorous evaluations across high-resolution datasets and perceptual metrics demonstrate LDMVFI's superior performance over traditional VFI methods.

Overview of "LDMVFI: Video Frame Interpolation with Latent Diffusion Models"

The paper "LDMVFI: Video Frame Interpolation with Latent Diffusion Models" by Duolikun Danier, Fan Zhang, and David Bull introduces a novel approach to video frame interpolation (VFI) by employing latent diffusion models (LDMs). Traditional methods in VFI generally utilize deep neural networks optimized based on loss metrics such as L1, L2, and VGG feature space distances. However, these metrics often fail to accurately assess perceptual quality as perceived by human observers. In contrast, the approach proposed in this paper reformulates VFI as a conditional generation problem, leveraging the generative capabilities of latent diffusion models.

Key Contributions

  1. Introduction of LDMVFI:
    • The research presents LDMVFI, a method that reframes the VFI task using LDMs, a form of diffusion models which perform operations in a compact latent space.
    • It employs an innovative autoencoding model, VQ-FIGAN, specifically designed for VFI by using latent diffusion models.
  2. VFI-Specific Innovations:
    • The paper proposes a vector-quantized autoencoding model (VQ-FIGAN) that enhances traditional LDMs through frame-specific improvements such as deformable kernel-based synthesis and MaxViT-based self-attention mechanisms.
  3. Benchmarking and Evaluation:
    • LDMVFI is rigorously benchmarked across several datasets, including high-resolution scenarios (up to 4K), showing favorable performance over state-of-the-art models according to perceptual metrics like LPIPS, FloLPIPS, and FID.
    • A user paper further corroborates the qualitative benefits of LDMVFI's outputs, highlighting its perceptual superiority.

Methodological Approach

LDMVFI introduces a novel perspective on handling VFI by transforming it into a generative modeling task. Traditional pixel-space-based formulations are replaced by latent space operations via a two-component system: a VQ-FIGAN encoder-decoder network and a denoising U-Net for reverse diffusion.

  • VQ-FIGAN Model:
    • This model improves upon typical autoencoders by incorporating neighboring frames' features into the decoding process through cross-attention mechanisms.
    • It adopts a vector quantization approach to improve the perceptual quality and representation capability of the latent features.
  • Diffusion Process:
    • Latent diffusion processes in LDMVFI are responsible for incrementally generating intermediate frames by conditioning on adjacent video frames.
    • This generative approach enhances the quality of interpolated frames, addressing the inadequacies of existing perceptual quality metrics.

Implications and Future Developments

The introduction of diffusion models in video frame interpolation opens new avenues for improving perceptual quality in video processing. This work suggests that integrating sophisticated generative techniques with VFI can offer substantial improvements over traditional methods, especially in complex motion scenarios like dynamic textures.

For future developments, optimizing LDM-based architectures for efficiency can mitigate the computational demands observed in LDMVFI. Exploring faster sampling techniques and model distillation could significantly improve inference speeds, making LDMVFI suitable for more real-time applications.

Conclusion

Overall, this paper marks a notable advance in the VFI field, demonstrating the efficacy of using latent diffusion models for generating high-quality video frames with enhanced perceptual fidelity. By creatively adopting engineering principles from generative models, this research provides a robust framework for future studies focused on perception-oriented video synthesis and processing tasks.