Papers

Topics

Authors

Recent

View all

Assistant AI Research Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

GPT-5.1

GPT-5.1 114 tok/s

Gemini 3.0 Pro 53 tok/s Pro

Gemini 2.5 Flash 132 tok/s Pro

Kimi K2 176 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Chrome Extension

Enhance arXiv with our new Chrome Extension.

Sponsor

Organize your preprints, BibTeX, and PDFs with Paperpile.
Get 30 days free

Content

Paper Summary Paper Prompts Open Problems Continue Learning Related Papers Authors Collections

T-DEED: Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in Sports Videos (2404.05392v2)

Published 8 Apr 2024 in cs.CV

Abstract: In this paper, we introduce T-DEED, a Temporal-Discriminability Enhancer Encoder-Decoder for Precise Event Spotting in sports videos. T-DEED addresses multiple challenges in the task, including the need for discriminability among frame representations, high output temporal resolution to maintain prediction precision, and the necessity to capture information at different temporal scales to handle events with varying dynamics. It tackles these challenges through its specifically designed architecture, featuring an encoder-decoder for leveraging multiple temporal scales and achieving high output temporal resolution, along with temporal modules designed to increase token discriminability. Leveraging these characteristics, T-DEED achieves SOTA performance on the FigureSkating and FineDiving datasets. Code is available at https://github.com/arturxe2/T-DEED.

References (46)

Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2911–2920, 2017.
End-to-end, single-stream temporal action detection in untrimmed videos. In Procedings of the British Machine Vision Conference 2017. British Machine Vision Association, 2019.
Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Soccernet 2023 challenges results. arXiv preprint arXiv:2309.06006, 2023.
Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018.
Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4508–4519, 2021.
Comedian: Self-supervised learning and knowledge distillation for action spotting using transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 530–540, 2024.
Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In International Conference on Machine Learning, pages 2793–2803. PMLR, 2021.
Daps: Deep action proposals for action understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 768–784. Springer, 2016.
Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1711–1721, 2018.
Soccernet 2022 challenges results. In Proceedings of the 5th International ACM Workshop on Multimedia Content Analysis in Sports, pages 75–86, 2022.
The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
Fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1914–1923, 2016.
Going deeper into action recognition: A survey. Image and vision computing, 60:4–21, 2017.
Video pose distillation for few-shot, fine-grained sports action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9254–9263, 2021.
Spotting temporally precise, fine-grained events in video. In European Conference on Computer Vision, pages 33–51. Springer, 2022.
Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
Learning salient boundary feature for anchor-free temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3320–3329, 2021.
Single shot temporal action detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 988–996, 2017.
End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 31:5427–5441, 2022.
Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Efficient non-maximum suppression. In 18th international conference on pattern recognition (ICPR’06), pages 850–855. IEEE, 2006.
Temporal context aggregation network for temporal action proposal refinement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 485–494, 2021.
Designing network design spaces. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10428–10436, 2020.
Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2616–2625, 2020.
Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18857–18866, 2023.
Temporally precise action spotting in soccer videos using dense detection anchors. In 2022 IEEE International Conference on Image Processing (ICIP), pages 2796–2800. IEEE, 2022.
Gate-shift networks for video action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1102–1111, 2020.
Gate-shift-fuse for video action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Attention is all you need. Advances in neural information processing systems, 30, 2017.
Astra: An action spotting transformer for soccer videos. In Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports, pages 93–102, 2023.
A survey on temporal action localization. IEEE Access, 8:70477–70487, 2020.
Finediving: A fine-grained dataset for procedure-aware action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2949–2958, 2022.
G-tad: Sub-graph localization for temporal action detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10156–10165, 2020.
Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 29:8535–8548, 2020.
Structured attention composition for temporal action localization. IEEE Transactions on Image Processing, 2022.
Every moment counts: Dense detailed labeling of actions in complex videos. International Journal of Computer Vision, 126:375–389, 2018.
Actionformer: Localizing moments of actions with transformers. In European Conference on Computer Vision, pages 492–510. Springer, 2022.
mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Vid2player: Controllable video sprites that behave and appear like professional tennis players. ACM Transactions on Graphics (TOG), 40(3):1–16, 2021.
Hacs: Human action clips and segments dataset for recognition and temporal localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019.