ASTRA: An Action Spotting TRAnsformer for Soccer Videos (2404.01891v1)
Abstract: In this paper, we introduce ASTRA, a Transformer-based model designed for the task of Action Spotting in soccer matches. ASTRA addresses several challenges inherent in the task and dataset, including the requirement for precise action localization, the presence of a long-tail data distribution, non-visibility in certain actions, and inherent label noise. To do so, ASTRA incorporates (a) a Transformer encoder-decoder architecture to achieve the desired output temporal resolution and to produce precise predictions, (b) a balanced mixup strategy to handle the long-tail distribution of the data, (c) an uncertainty-aware displacement head to capture the label variability, and (d) input audio signal to enhance detection of non-visible actions. Results demonstrate the effectiveness of ASTRA, achieving a tight Average-mAP of 66.82 on the test set. Moreover, in the SoccerNet 2023 Action Spotting challenge, we secure the 3rd position with an Average-mAP of 70.21 on the challenge set.
- Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
- Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision. 5561–5569.
- End-to-end, single-stream temporal action detection in untrimmed videos. (2019).
- Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2911–2920.
- End-to-end object detection with transformers. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 213–229.
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
- Refinement of Boundary Regression Using Uncertainty in Temporal Action Localization.. In BMVC.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
- A context-aware loss function for action spotting in soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13126–13136.
- Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4508–4519.
- Daps: Deep action proposals for action understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 768–784.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 776–780.
- SoccerNet 2022 challenges results. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports. 75–86.
- The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision. 5842–5850.
- Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1914–1923.
- CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131–135.
- Video pose distillation for few-shot, fine-grained sports action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9254–9263.
- Spotting Temporally Precise, Fine-Grained Events in Video. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV. Springer, 33–51.
- Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.
- Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3320–3329.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision. 7083–7093.
- Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia. 988–996.
- End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing 31 (2022), 5427–5441.
- A comprehensive review of computer vision in sports: Open issues, future trends and research directions. Applied Sciences 12, 9 (2022), 4429.
- Audio-visual classification and detection of human manipulation actions. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 3045–3052.
- Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 485–494.
- MAiVAR: Multimodal Audio-Image and Video Action Recognizer. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP). IEEE, 1–5.
- Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2616–2625.
- Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18857–18866.
- Joao VB Soares and Avijit Shah. 2022. Action spotting using dense detection anchors revisited: Submission to the SoccerNet Challenge 2022. arXiv preprint arXiv:2206.07846 (2022).
- Temporally Precise Action Spotting in Soccer Videos Using Dense Detection Anchors. In 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2796–2800.
- Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1102–1111.
- Computer vision for sports: Current applications and research topics. Computer Vision and Image Understanding 159 (2017), 3–18.
- Bastien Vanderplaetse and Stephane Dupont. 2020. Improved soccer action spotting using both audio and video streams. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 896–897.
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
- Boundary uncertainty in a single-stage temporal action localization network. arXiv preprint arXiv:2008.11170 (2020).
- Finediving: A fine-grained dataset for procedure-aware action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2949–2958.
- G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10156–10165.
- Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing 29 (2020), 8535–8548.
- Auto-Encoding Score Distribution Regression for Action Quality Assessment. arXiv preprint arXiv:2111.11029 (2021).
- Actionformer: Localizing moments of actions with transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. Springer, 492–510.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017).
- Vid2player: Controllable video sprites that behave and appear like professional tennis players. ACM Transactions on Graphics (TOG) 40, 3 (2021), 1–16.
- Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV). 803–818.
- Feature combination meets attention: Baidu soccer embeddings and transformer based temporal detection. arXiv preprint arXiv:2106.14447 (2021).
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.