LDMVFI: Video Frame Interpolation with Latent Diffusion Models (2303.09508v3)
Abstract: Existing works on video frame interpolation (VFI) mostly employ deep neural networks that are trained by minimizing the L1, L2, or deep feature space distance (e.g. VGG loss) between their outputs and ground-truth frames. However, recent works have shown that these metrics are poor indicators of perceptual VFI quality. Towards developing perceptually-oriented VFI methods, in this work we propose latent diffusion model-based VFI, LDMVFI. This approaches the VFI problem from a generative perspective by formulating it as a conditional generation problem. As the first effort to address VFI using latent diffusion models, we rigorously benchmark our method on common test sets used in the existing VFI literature. Our quantitative experiments and user study indicate that LDMVFI is able to interpolate video content with favorable perceptual quality compared to the state of the art, even in the high-resolution regime. Our code is available at https://github.com/danier97/LDMVFI.
- A database and evaluation methodology for optical flow. Int. J. of Comput. Vis., 92(1): 1–31.
- MEMC-Net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. on Pattern Anal. Mach. Intell., 43(3): 933–948.
- Model compression. In Proc. of the 12th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 535–541.
- Video frame interpolation via deformable separable convolution. In Proc. of the AAAI Conf. on Artificial Intell., 10607–10614.
- Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Trans. on Pattern Anal. Mach. Intell., 44(10): 7029–7045.
- Channel attention is all you need for video frame interpolation. In Proc. of the AAAI Conf. on Artificial Intell., volume 34, 10663–10671.
- Deformable convolutional networks. In Proc. of the IEEE Int. Conf. on Comput. Vis., 764–773.
- Enhancing deformable convolution based video frame interpolation with coarse-to-fine 3D CNN. In IEEE Int. Conf. on Image Process., 1396–1400.
- FloLPIPS: A Bespoke Video Quality Metric for Frame Interpolation. In IEEE Picture Coding Symposium, 283–287.
- ST-MFNet: A spatio-temporal multi-flow network for frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 3521–3531.
- A subjective quality study for video frame interpolation. In IEEE Int. Conf. on Image Process., 1361–1365.
- Denton, E. 2021. Ethical considerations of generative ai. In AI for Content Creation Workshop, CVPR, volume 27.
- Diffusion models beat gans on image synthesis. Adv. in Neural Inform. Process. Syst., 34: 8780–8794.
- CDFI: Compression-Driven Network Design for Frame Interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 8001–8011.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Int. Conf. on Learn. Represent.
- Taming transformers for high-resolution image synthesis. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 12873–12883.
- Deepstereo: Learning to predict new views from the world’s imagery. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 5515–5524.
- Generative adversarial networks. Communications of the ACM, 63(11): 139–144.
- GANs trained by a two time-scale update rule converge to a local nash equilibrium. Adv. in Neural Inform. Process. Syst., 30.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Denoising diffusion probabilistic models. Adv. in Neural Inform. Process. Syst., 33: 6840–6851.
- Cascaded Diffusion Models for High Fidelity Image Generation. J. of Machine Learning Research, 23(47): 1–33.
- Image-to-image translation with conditional adversarial networks. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1125–1134.
- ITU-R BT, R. 2002. 500-11, Methodology for the subjective assessment of the quality of television pictures,”. Int. Telecommunication Union, Tech. Rep.
- Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 9000–9008.
- FLAVR: Flow-agnostic video representations for fast frame interpolation. In Proc. of the IEEE Winter Conf. on Applications of Comput. Vis., 2071–2082.
- NTIRE 2023 video colorization challenge. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1570–1581.
- Three-dimensional reconstruction of the digestive wall in capsule endoscopy videos using elastic video interpolation. IEEE Trans. on Medical Imaging, 30(4): 957–971.
- Elucidating the Design Space of Diffusion-Based Generative Models. In Oh, A. H.; Agarwal, A.; Belgrave, D.; and Cho, K., eds., Adv. in Neural Inform. Process. Syst., volume 35, 26565–26577.
- Adam: A Method for Stochastic Optimization. In Int. Conf. on Learn. Represent.
- Auto-Encoding Variational Bayes. In Int. Conf. on Learn. Represent.
- IFRNet: Intermediate feature refine network for efficient frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1969–1978.
- Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700.
- AdaCoF: Adaptive collaboration of flows for video frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 5316–5325.
- Srdiff: Single image super-resolution with diffusion probabilistic models. Neurocomputing, 479: 47–59.
- Video frame synthesis using deep voxel flow. In Proc. of the IEEE Int. Conf. on Comput. Vis., 4463–4471.
- Decoupled Weight Decay Regularization. In Int. Conf. on Learn. Represent.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Adv. in Neural Inform. Process. Syst., 35: 5775–5787.
- Video frame interpolation with transformer. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 3532–3542.
- BVI-DVC: A Training Database for Deep Video Compression. IEEE Trans. on Multimedia, 1–1.
- A study of high frame rate video formats. IEEE Trans. on Multimedia, 21(6): 1499–1512.
- Visual quality assessment for interpolated slow-motion videos based on a novel database. In IEEE Int. Conf. on Quality of Multimedia Experience, 1–6.
- ST-MFNet Mini: Knowledge Distillation-Driven Frame Interpolation. arXiv preprint arXiv:2302.08455.
- On the optimal presentation duration for subjective video quality assessment. IEEE Trans. on Circuit Syst. Video Technol., 26(11): 1977–1987.
- Context-aware synthesis for video frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 1701–1710.
- Softmax splatting for video frame interpolation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 5437–5446.
- Video frame interpolation via adaptive convolution. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 670–679.
- BMBC: Bilateral motion estimation with bilateral cost volume for video interpolation. In Comput. Vis.–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 109–125. Springer.
- Asymmetric Bilateral Motion Estimation for Video Frame Interpolation. In Proc. of IEEE Int. Conf. on Comput. Vis., 14539–14548.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 724–732.
- Learning separable filters. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 2754–2761.
- High-resolution image synthesis with latent diffusion models. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 10684–10695.
- Perceptual Video Quality Assessment: The Journey Continues! Frontiers in Signal Processing, 3.
- Progressive Distillation for Fast Sampling of Diffusion Models. In Int. Conf. on Learn. Represent.
- XVFI: eXtreme Video Frame Interpolation. In Proc. of the IEEE Int. Conf. on Comput. Vis., 14489–14498.
- Very Deep Convolutional Networks for Large-Scale Image Recognition. In Int. Conf. on Learn. Represent.
- Deep animation video interpolation in the wild. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 6587–6595.
- Deep unsupervised learning using nonequilibrium thermodynamics. In Int. Conf. on Machine Learning, 2256–2265. PMLR.
- Denoising Diffusion Implicit Models. In Int. Conf. on Learn. Represent.
- UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
- Maxvit: Multi-axis vision transformer. In Comput. Vis.–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, 459–479. Springer.
- Neural discrete representation learning. Adv. in Neural Inform. Process. Syst., 30.
- Attention is all you need. Adv. in Neural Inform. Process. Syst., 30.
- MCVD - Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. In Adv. in Neural Inform. Process. Syst., volume 35, 23371–23385.
- Image quality assessment: from error visibility to structural similarity. IEEE Trans. on Image Process., 13(4): 600–612.
- Video compression through image interpolation. In Proc. of the European Conf. on Comput. Vis., 416–431.
- Video enhancement with task-oriented flow. Int. J. of Comput. Vis., 127(8): 1106–1125.
- Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796.
- Yang, R. 2021. NTIRE 2021 challenge on quality enhancement of compressed video: Dataset and study. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 667–676.
- Perceptual Learned Video Compression with Recurrent Conditional GAN. In Processings of the Int. Joint Conf. on Artificial Intell., 1537–1544.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proc. of the IEEE Conf. on Comput. Vis. and Pattern Recog., 586–595.