X-VARS: Introducing Explainability in Football Refereeing with Multi-Modal Large Language Model
Abstract: The rapid advancement of artificial intelligence has led to significant improvements in automated decision-making. However, the increased performance of models often comes at the cost of explainability and transparency of their decision-making processes. In this paper, we investigate the capabilities of LLMs to explain decisions, using football refereeing as a testing ground, given its decision complexity and subjectivity. We introduce the Explainable Video Assistant Referee System, X-VARS, a multi-modal LLM designed for understanding football videos from the point of view of a referee. X-VARS can perform a multitude of tasks, including video description, question answering, action recognition, and conducting meaningful conversations based on video content and in accordance with the Laws of the Game for football referees. We validate X-VARS on our novel dataset, SoccerNet-XFoul, which consists of more than 22k video-question-answer triplets annotated by over 70 experienced football referees. Our experiments and human study illustrate the impressive capabilities of X-VARS in interpreting complex football clips. Furthermore, we highlight the potential of X-VARS to reach human performance and support football referees in the future.
- Flamingo: a visual language model for few-shot learning. arXiv, abs/2204.14198, 2022.
- Using player’s body-orientation to model pass feasibility in soccer. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 3875–3884, Seattle, WA, USA, Jun. 2020. Inst. Electr. Electron. Eng. (IEEE).
- Language models are few-shot learners. arXiv, abs/2005.14165, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. arXiv, abs/2102.08981, 2021.
- ShareGPT4V: Improving large multi-modal models with better captions. arXiv, abs/2311.12793, 2023.
- PaLM: Scaling language modeling with pathways. arXiv, abs/2204.02311, 2022.
- Scaling up SoccerNet with multi-view spatial localization and re-identification. Sci. Data, 9(1):1–9, Jun. 2022.
- A context-aware loss function for action spotting in soccer videos. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 13123–13133, Seattle, WA, USA, Jun. 2020. Inst. Electr. Electron. Eng. (IEEE).
- Camera calibration and player localization in SoccerNet-v2 and investigation of their representations for action spotting. In IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, pages 4532–4541, Nashville, TN, USA, Jun. 2021.
- ARTHuS: Adaptive real-time human segmentation in sports through online distillation. In IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, pages 2505–2514, Long Beach, CA, USA, Jun. 2019. Inst. Electr. Electron. Eng. (IEEE).
- SoccerNet-tracking: Multiple object tracking dataset and benchmark in soccer videos. In IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, pages 3490–3501, New Orleans, LA, USA, Jun. 2022. Inst. Electr. Electron. Eng. (IEEE).
- SoccerNet 2023 challenges results. arXiv, abs/2309.06006, 2023.
- SoccerNet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. In IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, pages 4508–4519, Nashville, TN, USA, Jun. 2021.
- COMEDIAN: Self-supervised learning and knowledge distillation for action spotting using transformers. arXiv, abs/2309.01270, 2023.
- QLoRA: Efficient finetuning of quantized LLMs. arXiv, abs/2305.14314, 2023.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv, abs/1810.04805, 2018.
- Multiscale vision transformers. In IEEE Int. Conf. Comput. Vis. (ICCV), pages 6804–6815, Montréal, Can., Oct. 2021. Inst. Electr. Electron. Eng. (IEEE).
- Foul prediction with estimated poses from soccer broadcast video. arXiv, abs/2402.09650, 2024.
- FIFA. Semi automated offside technology. https://www.fifa.com/technical/football-technology/football-technologies-and-innovations-at-the-fifa-world-cup-2022/semi-automated-offside-technology, 2023.
- Soccer game summarization using audio commentary, metadata, and captions. Proceedings of the 1st Workshop on User-centric Narrative Summarization of Long Videos, Oct. 2022.
- SoccerNet: A scalable dataset for action spotting in soccer videos. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 1792–179210, Salt Lake City, UT, USA, Jun. 2018. Inst. Electr. Electron. Eng. (IEEE).
- SoccerNet 2022 challenges results. In Int. ACM Work. Multimedia Content Anal. Sports (MMSports), pages 75–86, Lisbon, Port., Oct. 2022. ACM.
- Towards active learning for action spotting in association football videos. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 5098–5108, Vancouver, Can., Jun. 2023. Inst. Electr. Electron. Eng. (IEEE).
- VARS: Video assistant referee system for automated soccer decision making from multiple views. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 5086–5097, Vancouver, Can., Jun. 2023. Inst. Electr. Electron. Eng. (IEEE).
- Training compute-optimal large language models. arXiv, abs/2203.15556, 2022.
- Pass receiver prediction in soccer using video and players’ trajectories. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 3502–3511, New Orleans, LA, USA, Jun. 2022. Inst. Electr. Electron. Eng. (IEEE).
- Spotting temporally precise, fine-grained events in video. In Eur. Conf. Comput. Vis. (ECCV), volume 13695 of Lect. Notes Comput. Sci., pages 33–51, Tel Aviv, Israël, 2022. Springer Nat. Switz.
- LoRA: Low-rank adaptation of large language models. arXiv, abs/2106.09685, 2021.
- Language is not all you need: Aligning perception with language models. arXiv, abs/2302.14045, 2023.
- IFAB. Laws of the game. Technical report, The International Football Association Board, Zurich, Switzerland, 2022.
- DeepSportradar-v2: A multi-sport computer vision dataset for sport understandings. In Int. ACM Work. Multimedia Content Anal. Sports (MMSports), pages 23–29, Ottawa, Ontario, Can., Oct. 2023. ACM.
- SoccerDB: A large-scale database for comprehensive video understanding. In Int. ACM Work. Multimedia Content Anal. Sports (MMSports), page 1–8, Seattle, WA, USA, Oct. 2020. ACM.
- Rethinking explainability as a dialogue: A practitioner’s perspective. arXiv, abs/2202.01875, 2022.
- TVQA: Localized, compositional video question answering. arXiv, abs/1809.01696, 2018.
- Sports-QA: A large-scale video question answering benchmark for complex and professional sports. arXiv, abs/2401.01505, 2024.
- MViTv2: Improved multiscale vision transformers for classification and detection. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), pages 4794–4804, New Orleans, LA, USA, Jun. 2022. Inst. Electr. Electron. Eng. (IEEE).
- Visual instruction tuning. arXiv, abs/2304.08485, 2023.
- All keypoints you need: Detecting arbitrary keypoints on the body of triple, high, and long jump athletes. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 5179–5187, Vancouver, Can., Jun. 2023. Inst. Electr. Electron. Eng. (IEEE).
- A unified approach to interpreting model predictions. arXiv, abs/1705.07874, 2017.
- Macaw-LLM: Multi-modal language modeling with image, audio, video, and text integration. arXiv, abs/2306.09093, 2023.
- Video-ChatGPT: Towards detailed video understanding via large vision and language models. arXiv, abs/2306.05424, 2023.
- Efficient tracking of team sport players with few game-specific annotations. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 3460–3470, New Orleans, LA, USA, Jun. 2022. Inst. Electr. Electron. Eng. (IEEE).
- Multi-task learning for joint re-identification, team affiliation, and role classification for sports visual tracking. In Int. ACM Work. Multimedia Content Anal. Sports (MMSports), page 103–112, Ottawa, Ontario, Can., Oct. 2023. ACM.
- MMSys’22 grand challenge on AI-based video production for soccer. In ACM Multimedia Systems Conference (MMSys), pages 1–6, Athlone, Ireland, Jun. 2022.
- Ai-based sports highlight generation for social media. Proceedings of the 3rd Mile-High Video Conference on zzz, Feb. 2024.
- HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. arXiv, abs/1906.03327, 2019.
- SoccerNet-caption: Dense video captioning for soccer broadcasts commentaries. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 5074–5085, Vancouver, Can., Jun. 2023. Inst. Electr. Electron. Eng. (IEEE).
- Training language models to follow instructions with human feedback. arXiv, abs/2203.02155, 2022.
- A public data set of spatio-temporal match events in soccer competitions. Sci. Data, 6(1):1–15, Oct. 2019.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn. (ICML), pages 8748–8763, Jul. 2021.
- ”why should i trust you?”: Explaining the predictions of any classifier. arXiv, abs/1602.04938, 2016.
- BLOOM: A 176B-parameter open-access multilingual language model. arXiv, abs/2211.05100, 2022.
- LAION-5b: An open large-scale dataset for training next generation image-text models. arXiv, abs/2210.08402, 2022.
- SoccerTrack: A dataset and tracking algorithm for soccer with fish-eye and drone videos. In IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), pages 3568–3578, New Orleans, LA, USA, Jun. 2022. Inst. Electr. Electron. Eng. (IEEE).
- Grad-CAM: Visual explanations from deep networks via gradient-based localization. arXiv, abs/1610.02391, 2016.
- Improving object detection quality in football through super-resolution techniques. arXiv, abs/2402.00163, 2024.
- Survey of action recognition, spotting and spatio-temporal localization in soccer – current trends and research perspectives. arXiv, abs/2309.12067, 2023.
- João V. B. Soares and Avijit Shah. Action spotting using dense detection anchors revisited: Submission to the SoccerNet challenge 2022. arXiv, abs/2206.07846, 2022.
- Temporally precise action spotting in soccer videos using dense detection anchors. In IEEE Int. Conf. Image Process. (ICIP), pages 2796–2800, Bordeaux, France, Oct. 2022. Inst. Electr. Electron. Eng. (IEEE).
- Body part-based representation learning for occluded person Re-Identification. In IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), pages 1613–1623, Waikoloa, HI, USA, Jan. 2023. Inst. Electr. Electron. Eng. (IEEE).
- Going for GOAL: A resource for grounded football commentaries. arXiv, abs/2211.04534, 2022.
- MovieQA: Understanding stories in movies through question-answering. arXiv, abs/1512.02902, 2015.
- Computer vision for sports: current applications and research topics. Comput. Vis. Image Underst., 159:3–18, Jun. 2017.
- LLaMA: Open and efficient foundation language models. arXiv, abs/2302.13971, 2023.
- DeepSportradar-v1: Computer vision dataset for sports understanding with high quality annotations. In Int. ACM Work. Multimedia Content Anal. Sports (MMSports), pages 1–8, Lisbon, Port., Oct. 2022. ACM.
- Semi-supervised training to improve player and ball detection in soccer. In IEEE Int. Conf. Comput. Vis. Pattern Recognit. Work. (CVPRW), CVsports, pages 3480–3489, New Orleans, LA, USA, Jun. 2022. Inst. Electr. Electron. Eng. (IEEE).
- Counterfactual explanations without opening the black box: Automated decisions and the GDPR. arXiv, abs/1711.00399, 2017.
- BridgeTower: Building bridges between encoders in vision-language representation learning. arXiv, abs/2206.08657, 2022.
- mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv, abs/2311.04257, 2023.
- Comprehensive dataset of broadcast soccer videos. In IEEE Conf. Multimedia Inf. Process. Retr. (MIPR), pages 418–423, Miami, FL, USA, Apr. 2018. Inst. Electr. Electron. Eng. (IEEE).
- MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv, abs/2304.10592, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.