ViSTec: Video Modeling for Sports Technique Recognition and Tactical Analysis (2402.15952v1)
Abstract: The immense popularity of racket sports has fueled substantial demand in tactical analysis with broadcast videos. However, existing manual methods require laborious annotation, and recent attempts leveraging video perception models are limited to low-level annotations like ball trajectories, overlooking tactics that necessitate an understanding of stroke techniques. State-of-the-art action segmentation models also struggle with technique recognition due to frequent occlusions and motion-induced blurring in racket sports videos. To address these challenges, We propose ViSTec, a Video-based Sports Technique recognition model inspired by human cognition that synergizes sparse visual data with rich contextual insights. Our approach integrates a graph to explicitly model strategic knowledge in stroke sequences and enhance technique recognition with contextual inductive bias. A two-stage action perception model is jointly trained to align with the contextual knowledge in the graph. Experiments demonstrate that our method outperforms existing models by a significant margin. Case studies with experts from the Chinese national table tennis team validate our model's capacity to automate analysis for technical actions and tactical strategies. More details are available at: https://ViSTec2024.github.io/.
- Unified Fully and Timestamp Supervised Temporal Action Segmentation via Sequence to Sequence Translation. In Avidan, S.; Brostow, G. J.; Cissé, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXV, volume 13695 of Lecture Notes in Computer Science, 52–68. Springer.
- P22{}^{\mbox{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTA: A Dataset and Benchmark for Dense Action Detection from Table Tennis Match Broadcasting Videos. CoRR, abs/2207.12730.
- Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 4724–4733. IEEE Computer Society.
- Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 9451–9460. Computer Vision Foundation / IEEE.
- Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 13791–13800. IEEE.
- TIVEE: Visual Exploration and Explanation of Badminton Tactics in Immersive Visualizations. IEEE Transactions on Visualization and Computer Graphics, 28(1): 118–128.
- Bridging Machine Learning and Logical Reasoning by Abductive Learning. In Wallach, H. M.; Larochelle, H.; Beygelzimer, A.; d’Alché-Buc, F.; Fox, E. B.; and Garnett, R., eds., Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2811–2822.
- EventAnchor: Reducing Human Interactions in Event Annotation of Racket Sports Videos. In Kitamura, Y.; Quigley, A.; Isbister, K.; Igarashi, T.; Bjørn, P.; and Drucker, S. M., eds., CHI ’21: CHI Conference on Human Factors in Computing Systems, Virtual Event / Yokohama, Japan, May 8-13, 2021, 73:1–73:13. ACM.
- MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 3575–3584. Computer Vision Foundation / IEEE.
- TrackNet: A Deep Learning Network for Tracking High-speed and Tiny Objects in Sports Applications. In 16th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2019, Taipei, Taiwan, September 18-21, 2019, 1–8. IEEE.
- Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 1725–1732. IEEE Computer Society.
- A Sliding Window Scheme for Online Temporal Action Localization. In Avidan, S.; Brostow, G. J.; Cissé, M.; Farinella, G. M.; and Hassner, T., eds., Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIV, volume 13694 of Lecture Notes in Computer Science, 653–669. Springer.
- Temporal Convolutional Networks for Action Segmentation and Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 1003–1012. IEEE Computer Society.
- Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI conference on artificial intelligence, 11499–11506.
- VIRD: Immersive Match Video Analysis for High-Performance Badminton Coaching. arXiv:2307.12539.
- BMN: Boundary-Matching Network for Temporal Action Proposal Generation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 3888–3897. IEEE.
- BSN: Boundary Sensitive Network for Temporal Action Proposal Generation. In Ferrari, V.; Hebert, M.; Sminchisescu, C.; and Weiss, Y., eds., Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, volume 11208 of Lecture Notes in Computer Science, 3–21. Springer.
- FineAction: A Fine-Grained Video Dataset for Temporal Action Localization. IEEE Transactions on Image Processing, 31: 6937–6950.
- CourtTime: Generating Actionable Insights into Tennis Matches Using Visual Analytics. IEEE Transactions on Visualization and Computer Graphics, 26(1): 397–406.
- TenniVis: Visualization for Tennis Match Analysis. IEEE Transactions on Visualization and Computer Graphics, 20(12): 2339–2348.
- FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2613–2622. Computer Vision Foundation / IEEE.
- CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 1417–1426. IEEE Computer Society.
- Coarse to Fine Multi-Resolution Temporal Convolutional Network. arXiv:2105.10859.
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. CoRR, abs/2203.12602.
- Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 4489–4497. IEEE Computer Society.
- TTNet: Real-time temporal and spatial video analysis of table tennis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR Workshops 2020, Seattle, WA, USA, June 14-19, 2020, 3866–3874. Computer Vision Foundation / IEEE.
- Tac-Valuer: Knowledge-based Stroke Evaluation in Table Tennis. In Zhu, F.; Ooi, B. C.; and Miao, C., eds., KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, Singapore, August 14-18, 2021, 3688–3696. ACM.
- Tac-Miner: Visual Tactic Mining for Multiple Table Tennis Matches. IEEE Transactions on Visualization and Computer Graphics, 27(6): 2770–2782.
- Tac-Simur: Tactic-based Simulative Visual Analytics of Table Tennis. IEEE Trans. Vis. Comput. Graph., 26(1): 407–417.
- Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell., 41(11): 2740–2755.
- iTTVis: Interactive Visualization of Table Tennis Data. IEEE Trans. Vis. Comput. Graph., 24(1): 709–718.
- G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10156–10165.
- ShuttleSpace: Exploring and Analyzing Movement Trajectory in Immersive Visualization. IEEE Transactions on Visualization and Computer Graphics, 27(2): 860–869.
- ASFormer: Transformer for Action Segmentation. In 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021, 236. BMVA Press.
- Vid2player: Controllable video sprites that behave and appear like professional tennis players. ACM Transactions on Graphics (TOG), 40(3): 1–16.
- Learning Physically Simulated Tennis Skills from Broadcast Videos. ACM Trans. Graph., 42(4).