Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 83 tok/s
GPT OSS 120B 467 tok/s Pro
Kimi K2 197 tok/s Pro
2000 character limit reached

VFIMamba: Video Frame Interpolation with State Space Models (2407.02315v2)

Published 2 Jul 2024 in cs.CV and cs.AI

Abstract: Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI). Current approaches predominantly rely on convolution or attention-based models, which often either lack sufficient receptive fields or entail significant computational overheads. Recently, Selective State Space Models (S6) have emerged, tailored specifically for long sequence modeling, offering both linear complexity and data-dependent modeling capabilities. In this paper, we propose VFIMamba, a novel frame interpolation method for efficient and dynamic inter-frame modeling by harnessing the S6 model. Our approach introduces the Mixed-SSM Block (MSB), which initially rearranges tokens from adjacent frames in an interleaved fashion and subsequently applies multi-directional S6 modeling. This design facilitates the efficient transmission of information across frames while upholding linear complexity. Furthermore, we introduce a novel curriculum learning strategy that progressively cultivates proficiency in modeling inter-frame dynamics across varying motion magnitudes, fully unleashing the potential of the S6 model. Experimental findings showcase that our method attains state-of-the-art performance across diverse benchmarks, particularly excelling in high-resolution scenarios. In particular, on the X-TEST dataset, VFIMamba demonstrates a noteworthy improvement of 0.80 dB for 4K frames and 0.96 dB for 2K frames.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. The hidden attention of mamba models. arXiv preprint arXiv:2403.01590, 2024.
  2. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3703–3712, 2019.
  3. Mambamixer: Efficient selective state space models with dual token and channel selection. arXiv preprint arXiv:2403.19888, 2024.
  4. Curriculum learning. In International Conference on Machine Learning, pp.  41–48, 2009.
  5. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  10663–10671, 2020.
  6. Deepstereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5515–5524, 2016.
  7. Hungry hungry hippos: Towards language modeling with state space models. In International Conference on Learning Representations, 2022.
  8. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  9. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2021.
  10. Mambair: A simple baseline for image restoration with state-space model. arXiv preprint arXiv:2402.15648, 2024.
  11. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1026–1034, 2015.
  12. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7132–7141, 2018.
  13. Many-to-many splatting for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3553–3562, 2022.
  14. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision, pp.  624–642. Springer, 2022.
  15. Neighbor correspondence matching for flow-based video frame synthesis. In ACM MM, pp.  5389–5397, 2022.
  16. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9000–9008, 2018.
  17. Flavr: Flow-agnostic video representations for fast frame interpolation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  2071–2082, 2023.
  18. Benchmarking video frame interpolation. arXiv preprint arXiv:2403.17128, 2024.
  19. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1969–1978, 2022.
  20. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5316–5325, 2020.
  21. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
  22. Amt: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  9801–9810, 2023.
  23. Sparse global matching for video frame interpolation with large motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024a.
  24. Selflow: Self-supervised learning of optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4571–4580, 2019.
  25. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024b.
  26. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4463–4471, 2017.
  27. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  10012–10022, 2021.
  28. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  3532–3542, 2022.
  29. Understanding the effective receptive field in deep convolutional neural networks. Advances in Neural Information Processing Systems, 29, 2016.
  30. Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947, 2022.
  31. Montgomery, C. Xiph.org video test media (derf’s collection). In Online,https://media.xiph.org/video/derf/, 1994.
  32. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1701–1710, 2018.
  33. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5437–5446, 2020.
  34. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  261–270, 2017.
  35. Biformer: Learning bilateral motion estimation via bilateral transformer for 4k video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  1568–1577, 2023.
  36. Film: Frame interpolation for large motion. In European Conference on Computer Vision, pp.  250–266. Springer, 2022.
  37. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pp.  234–241. Springer, 2015.
  38. Xvfi: extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  14489–14498, 2021.
  39. Simplified state space layers for sequence modeling. In International Conference on Learning Representations, 2022.
  40. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  41. Szeliski, R. Prediction error as a quality metric for motion and stereo. In Proceedings of the IEEE/CVF International Conference on Computer Vision, volume 2, pp.  781–788. IEEE, 1999.
  42. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision, pp.  402–419. Springer, 2020.
  43. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  44. Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3):415–424, 2022.
  45. Image quality assessment: from error visibility to structural similarity. IEEE TIP, 13(4):600–612, 2004.
  46. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127:1106–1125, 2019.
  47. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  5682–5692, 2023.
  48. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024a.
  49. Dual detrs for multi-label temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18559–18569, 2024b.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.