Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation (2308.16876v2)

Published 31 Aug 2023 in cs.CV

Abstract: Human-centric video frame interpolation has great potential for improving people's entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available in the community, none of them is dedicated for human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($\geq$720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. It highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve the accuracy, we introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. The loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results validate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. A database and evaluation methodology for optical flow. International journal of computer vision, 92(1):1–31, 2011.
  2. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3703–3712, 2019.
  3. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE transactions on pattern analysis and machine intelligence, 43(3):933–948, 2019.
  4. Neural frame interpolation for rendered content. ACM Trans. Graph., 40(6):239:1–239:13, 2021.
  5. A naturalistic open source movie for optical flow evaluation. In ECCV, 2012.
  6. Investigating tradeoffs in real-world video super-resolution. In CVPR, 2022.
  7. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of 1st international conference on image processing, volume 2, pages 168–172. IEEE, 1994.
  8. Revisiting event-based video frame interpolation. arXiv preprint arXiv:2307.12558, 2023.
  9. Videoinr: Learning video implicit neural representation for continuous space-time super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2047–2057, 2022.
  10. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
  11. Video frame interpolation via deformable separable convolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10607–10614, 2020.
  12. Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7029–7045, 2021.
  13. All at once: Temporally adaptive multi-frame interpolation with advanced motion modeling. In European Conference on Computer Vision, pages 107–123. Springer, 2020.
  14. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10663–10671, 2020.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. FlowNet: Learning optical flow with convolutional networks. In ICCV, 2015.
  17. Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.
  18. Humans in 4D: Reconstructing and tracking humans with transformers. In International Conference on Computer Vision (ICCV), 2023.
  19. Featureflow: Robust video interpolation via structure-to-texture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14004–14013, 2020.
  20. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  21. Timereplayer: Unlocking the potential of event cameras for video interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17804–17813, 2022.
  22. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  23. Many-to-many splatting for efficient video frame interpolation. In CVPR, 2022.
  24. Rife: Real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294, 2020.
  25. Real-time intermediate flow estimation for video frame interpolation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, pages 624–642. Springer, 2022.
  26. Spatial transformer networks. Advances in neural information processing systems, 28, 2015.
  27. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, 2018.
  28. A unified pyramid recurrent network for video frame interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2023.
  29. Enhanced bi-directional motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5049–5057, 2023.
  30. YOLO by Ultralytics, 1 2023.
  31. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  32. Flavr: Flow-agnostic video representations for fast frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2071–2082, 2023.
  33. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
  34. PARE: Part attention regressor for 3D human body estimation. In ICCV, 2021.
  35. Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1978, 2022.
  36. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5316–5325, 2020.
  37. Deep animation video interpolation in the wild. In CVPR, 2021.
  38. Comisr: Compression-informed video super-resolution. In ICCV, 2021.
  39. Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021.
  40. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  41. Deep video frame interpolation using cyclic frame generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8794–8802, 2019.
  42. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  43. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 4463–4471, 2017.
  44. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3532–3542, 2022.
  45. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
  46. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  47. Phasenet for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 498–507, 2018.
  48. Phase-based frame interpolation for video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1418, 2015.
  49. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  50. Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017.
  51. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1710, 2018.
  52. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5437–5446, 2020.
  53. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 670–679, 2017.
  54. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE international conference on computer vision, pages 261–270, 2017.
  55. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14539–14548, 2021.
  56. Im-net for high resolution video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2398–2407, 2019.
  57. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
  58. On the benefits of 3d pose and tracking for human action recognition. In CVPR, 2023.
  59. Learning multi-human optical flow. Int. J. Comput. Vis., 128(4):873–890, 2020.
  60. Film: Frame interpolation for large motion. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 250–266. Springer, 2022.
  61. Unsupervised video interpolation using cycle consistency. In Proceedings of the IEEE/CVF international conference on computer Vision, pages 892–900, 2019.
  62. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  63. Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17482–17491, 2022.
  64. Xvfi: Extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14489–14498, 2021.
  65. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  66. Deep video deblurring for hand-held cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1279–1288, 2017.
  67. Hunting group clues with transformers for social group activity recognition. In ECCV, 2022.
  68. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  69. Time lens++: Event-based frame interpolation with parametric non-linear flow and multi-scale fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17755–17764, 2022.
  70. Time lens: Event-based video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16155–16164, 2021.
  71. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  72. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  73. Video compression through image interpolation. In ECCV, 2018.
  74. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
  75. Space-time neural irradiance fields for free-viewpoint video. In CVPR, 2021.
  76. Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
  77. Quadratic video interpolation. Advances in Neural Information Processing Systems, 32, 2019.
  78. Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484, 2022.
  79. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
  80. Decoupling human and camera motion from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
  81. Spatio-temporal dynamic inference network for group activity recognition. In ICCV, 2021.
  82. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5682–5692, 2023.
  83. Composer: Compositional reasoning of group activity in videos with keypoint-only modality. In ECCV, 2022.
Citations (6)

Summary

  • The paper introduces the SportsSloMo dataset with over 130K high-resolution sports clips, providing a rich resource for human-centric video frame interpolation.
  • The paper demonstrates that existing VFI techniques struggle with complex human movements and occlusions in sports, leading to notable performance drops.
  • The paper proposes innovative human-aware loss terms that consistently improve interpolation performance as measured by PSNR, SSIM, and IE metrics.

An Expert Review of "SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation"

The paper "SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation" presents a significant contribution to the field of computer vision and specifically to the domain of video frame interpolation (VFI) by introducing a novel benchmark dataset named SportsSloMo. This work focuses on enhancing human-centric video frame interpolation by addressing the complex challenges posed by sports videos, which involve deformable human bodies and frequent occlusions.

Summary of Contributions

  1. Introduction of SportsSloMo Dataset: The authors have curated SportsSloMo, a large benchmark featuring over 130,000 high-resolution (\geq720p) slow-motion sports video clips sourced from YouTube. This dataset comprises more than 1 million frames, providing a diverse range of sports scenarios that are absent in existing benchmarks. The dataset's primary focus is human-centric video content, which is crucial given the increasing consumer interest in slow-motion videos for enhanced entertainment and sports analysis.
  2. Challenges in Human-centric Scenarios: The paper highlights the difficulties faced by state-of-the-art VFI techniques when applied to this dataset. Human-centric scenarios in sports inherently involve highly deformable structures and occlusions, making it challenging for existing methods to maintain their performance levels observed in general-purpose datasets. The authors demonstrate a noticeable decrease in accuracy for several re-trained models on SportsSloMo.
  3. Proposed Human-aware Loss Terms: To address the intricacies of human motion and occlusion, the authors introduce novel human-aware loss terms. These terms incorporate auxiliary supervision via human segmentation in the panoptic setting and keypoint detection. The human-aware losses are model-agnostic, allowing them to be integrated into existing video interpolation methodologies straightforwardly.
  4. Empirical Validation and Enhancement of VFI Models: Extensive experiments validate that these loss terms lead to consistent performance improvement across various models. Notably, they employ these loss terms on several flow-based and flow-agnostic VFI models, demonstrating enhancements in PSNR, SSIM, and IE metrics on the SportsSloMo dataset.

Implications and Future Directions

Practically, this research advances the capabilities of video frame interpolation systems, which can now better handle sporting scenes involving humans. This enhancement has implications in areas such as sports broadcasting and coaching, where detailed frame-by-frame analysis of athletic performance is valuable.

Theoretically, the introduction of human-aware loss functions invites future explorations into more sophisticated context-aware loss designs. Additionally, the nuances captured by the dataset could encourage the development of more robust motion representation strategies, potentially using 3D cognition of scenes. This paper also implicitly points towards the need for integrating higher-order motion prediction models that account for complex occlusions and non-linear motion in future work.

The SportsSloMo benchmark and associated findings offer a solid foundation for subsequent innovations in video processing for human-centric applications. As the benchmark dataset is made publicly accessible, it is expected to stimulate a wave of research in fields paralleling human-centric visual interpolation and adjacent domains like video super-resolution and dynamic activity recognition in crowded sports scenes.

Youtube Logo Streamline Icon: https://streamlinehq.com