Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports (2401.01505v3)

Published 3 Jan 2024 in cs.CV

Abstract: Reasoning over sports videos for question answering is an important task with numerous applications, such as player training and information retrieval. However, this task has not been explored due to the lack of relevant datasets and the challenging nature it presents. Most datasets for video question answering (VideoQA) focus mainly on general and coarse-grained understanding of daily-life videos, which is not applicable to sports scenarios requiring professional action understanding and fine-grained motion analysis. In this paper, we introduce the first dataset, named Sports-QA, specifically designed for the sports VideoQA task. The Sports-QA dataset includes various types of questions, such as descriptions, chronologies, causalities, and counterfactual conditions, covering multiple sports. Furthermore, to address the characteristics of the sports VideoQA task, we propose a new Auto-Focus Transformer (AFT) capable of automatically focusing on particular scales of temporal information for question answering. We conduct extensive experiments on Sports-QA, including baseline studies and the evaluation of different methods. The results demonstrate that our AFT achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  2. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In Proceedings of the AAAI conference on artificial intelligence, pages 8102–8109, 2019.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  4. Dramaqa: Character-centered video story understanding with hierarchical qa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1166–1174, 2021.
  5. Soccernet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4508–4519, 2021.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  7. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1999–2007, 2019.
  8. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  9. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6576–6585, 2018.
  10. Knowit vqa: Answering knowledge-based questions about videos. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10826–10834, 2020.
  11. Soccernet: A scalable dataset for action spotting in soccer videos. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 1711–1721, 2018.
  12. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  13. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  14. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017.
  15. Reasoning with heterogeneous graph alignment for video question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 11109–11116, 2020.
  16. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. Contrastive learning for sports video: Unsupervised player classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4528–4536, 2021.
  18. Groupformer: Group activity recognition with clustered spatial-temporal transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13668–13677, 2021a.
  19. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13536–13545, 2021b.
  20. Invariant grounding for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2928–2937, 2022.
  21. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  22. Fine grained sport action recognition with twin spatio-temporal convolutional neural networks. Multimedia Tools and Applications, 79(27):20429–20447, 2020.
  23. Action quality assessment across multiple actions. In 2019 IEEE winter conference on applications of computer vision (WACV), pages 1468–1476. IEEE, 2019.
  24. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  25. Compressing cnn kernels for videos using tucker decompositions: Towards lightweight cnn applications. arXiv preprint arXiv:2203.07033, 2022.
  26. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
  27. Attend what you need: Motion-appearance synergistic networks for video question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6167–6177, 2021.
  28. Finegym: A hierarchical video dataset for fine-grained action understanding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  29. Uncertainty-aware score distribution learning for action quality assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9839–9848, 2020.
  30. Movieqa: Understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4631–4640, 2016.
  31. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  32. All in one: Exploring unified video-language pre-training. arXiv preprint arXiv:2203.07303, 2022a.
  33. Shuttlenet: Position-aware fusion of rally progress and player styles for stroke forecasting in badminton. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4219–4227, 2022b.
  34. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9777–9786, 2021.
  35. Video as conditional graph hierarchy for multi-granular question answering. AAAI, 2022.
  36. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  37. Just ask: Learning to answer questions from millions of narrated videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1686–1697, 2021.
  38. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 9127–9134, 2019.
  39. Spatio-temporal dynamic inference network for group activity recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7476–7485, 2021.
  40. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019.
  41. Merlot: Multimodal neural script knowledge models. Advances in Neural Information Processing Systems, 34:23634–23651, 2021.
  42. Fencenet: Fine-grained footwork recognition in fencing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3589–3598, 2022.
Citations (10)

Summary

We haven't generated a summary for this paper yet.