SportsSloMo: A New Benchmark and Baselines for Human-centric Video Frame Interpolation (2308.16876v2)
Abstract: Human-centric video frame interpolation has great potential for improving people's entertainment experiences and finding commercial applications in the sports analysis industry, e.g., synthesizing slow-motion videos. Although there are multiple benchmark datasets available in the community, none of them is dedicated for human-centric scenarios. To bridge this gap, we introduce SportsSloMo, a benchmark consisting of more than 130K video clips and 1M video frames of high-resolution ($\geq$720p) slow-motion sports videos crawled from YouTube. We re-train several state-of-the-art methods on our benchmark, and the results show a decrease in their accuracy compared to other datasets. It highlights the difficulty of our benchmark and suggests that it poses significant challenges even for the best-performing methods, as human bodies are highly deformable and occlusions are frequent in sports videos. To improve the accuracy, we introduce two loss terms considering the human-aware priors, where we add auxiliary supervision to panoptic segmentation and human keypoints detection, respectively. The loss terms are model agnostic and can be easily plugged into any video frame interpolation approaches. Experimental results validate the effectiveness of our proposed loss terms, leading to consistent performance improvement over 5 existing models, which establish strong baseline models on our benchmark. The dataset and code can be found at: https://neu-vi.github.io/SportsSlomo/.
- A database and evaluation methodology for optical flow. International journal of computer vision, 92(1):1–31, 2011.
- Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3703–3712, 2019.
- Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE transactions on pattern analysis and machine intelligence, 43(3):933–948, 2019.
- Neural frame interpolation for rendered content. ACM Trans. Graph., 40(6):239:1–239:13, 2021.
- A naturalistic open source movie for optical flow evaluation. In ECCV, 2012.
- Investigating tradeoffs in real-world video super-resolution. In CVPR, 2022.
- Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of 1st international conference on image processing, volume 2, pages 168–172. IEEE, 1994.
- Revisiting event-based video frame interpolation. arXiv preprint arXiv:2307.12558, 2023.
- Videoinr: Learning video implicit neural representation for continuous space-time super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2047–2057, 2022.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1290–1299, 2022.
- Video frame interpolation via deformable separable convolution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10607–10614, 2020.
- Multiple video frame interpolation via enhanced deformable separable convolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):7029–7045, 2021.
- All at once: Temporally adaptive multi-frame interpolation with advanced motion modeling. In European Conference on Computer Vision, pages 107–123. Springer, 2020.
- Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10663–10671, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- FlowNet: Learning optical flow with convolutional networks. In ICCV, 2015.
- Dynamic view synthesis from dynamic monocular video. In ICCV, 2021.
- Humans in 4D: Reconstructing and tracking humans with transformers. In International Conference on Computer Vision (ICCV), 2023.
- Featureflow: Robust video interpolation via structure-to-texture generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14004–14013, 2020.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Timereplayer: Unlocking the potential of event cameras for video interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17804–17813, 2022.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Many-to-many splatting for efficient video frame interpolation. In CVPR, 2022.
- Rife: Real-time intermediate flow estimation for video frame interpolation. arXiv preprint arXiv:2011.06294, 2020.
- Real-time intermediate flow estimation for video frame interpolation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, pages 624–642. Springer, 2022.
- Spatial transformer networks. Advances in neural information processing systems, 28, 2015.
- Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In CVPR, 2018.
- A unified pyramid recurrent network for video frame interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2023.
- Enhanced bi-directional motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5049–5057, 2023.
- YOLO by Ultralytics, 1 2023.
- Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
- Flavr: Flow-agnostic video representations for fast frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2071–2082, 2023.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9404–9413, 2019.
- PARE: Part attention regressor for 3D human body estimation. In ICCV, 2021.
- Ifrnet: Intermediate feature refine network for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1969–1978, 2022.
- Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5316–5325, 2020.
- Deep animation video interpolation in the wild. In CVPR, 2021.
- Comisr: Compression-informed video super-resolution. In ICCV, 2021.
- Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR, 2021.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Deep video frame interpolation using cyclic frame generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8794–8802, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pages 4463–4471, 2017.
- Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3532–3542, 2022.
- A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In CVPR, 2016.
- Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
- Phasenet for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 498–507, 2018.
- Phase-based frame interpolation for video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1410–1418, 2015.
- V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
- Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3883–3891, 2017.
- Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1701–1710, 2018.
- Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5437–5446, 2020.
- Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 670–679, 2017.
- Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE international conference on computer vision, pages 261–270, 2017.
- Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14539–14548, 2021.
- Im-net for high resolution video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2398–2407, 2019.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
- On the benefits of 3d pose and tracking for human action recognition. In CVPR, 2023.
- Learning multi-human optical flow. Int. J. Comput. Vis., 128(4):873–890, 2020.
- Film: Frame interpolation for large motion. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part VII, pages 250–266. Springer, 2022.
- Unsupervised video interpolation using cycle consistency. In Proceedings of the IEEE/CVF international conference on computer Vision, pages 892–900, 2019.
- Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
- Video frame interpolation transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17482–17491, 2022.
- Xvfi: Extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14489–14498, 2021.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Deep video deblurring for hand-held cameras. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1279–1288, 2017.
- Hunting group clues with transformers for social group activity recognition. In ECCV, 2022.
- Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
- Time lens++: Event-based frame interpolation with parametric non-linear flow and multi-scale fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17755–17764, 2022.
- Time lens: Event-based video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16155–16164, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
- Video compression through image interpolation. In ECCV, 2018.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Space-time neural irradiance fields for free-viewpoint video. In CVPR, 2021.
- Gmflow: Learning optical flow via global matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8121–8130, 2022.
- Quadratic video interpolation. Advances in Neural Information Processing Systems, 32, 2019.
- Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484, 2022.
- Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
- Decoupling human and camera motion from videos in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2023.
- Spatio-temporal dynamic inference network for group activity recognition. In ICCV, 2021.
- Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5682–5692, 2023.
- Composer: Compositional reasoning of group activity in videos with keypoint-only modality. In ECCV, 2022.