SportQA: A Benchmark for Sports Understanding in Large Language Models (2402.15862v2)
Abstract: A deep understanding of sports, a field rich in strategic and dynamic content, is crucial for advancing NLP. This holds particular significance in the context of evaluating and advancing LLMs, given the existing gap in specialized benchmarks. To bridge this gap, we introduce SportQA, a novel benchmark specifically designed for evaluating LLMs in the context of sports understanding. SportQA encompasses over 70,000 multiple-choice questions across three distinct difficulty levels, each targeting different aspects of sports knowledge from basic historical facts to intricate, scenario-based reasoning tasks. We conducted a thorough evaluation of prevalent LLMs, mainly utilizing few-shot learning paradigms supplemented by chain-of-thought (CoT) prompting. Our results reveal that while LLMs exhibit competent performance in basic sports knowledge, they struggle with more complex, scenario-based sports reasoning, lagging behind human expertise. The introduction of SportQA marks a significant step forward in NLP, offering a tool for assessing and enhancing sports understanding in LLMs.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Deep learning model based on a transformers network for sentiment analysis using nlp in sports worldwide. In Advances in Computing and Data Sciences, pages 328–339. Springer Nature Switzerland.
- Combining machine learning and human experts to predict match outcomes in football: A baseline model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 15447–15451.
- BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- KQA pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6101–6119, Dublin, Ireland. Association for Computational Linguistics.
- Sporthesia: Augmenting sports videos using natural language. IEEE transactions on visualization and computer graphics, 29(1):918–928.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
- Universal information extraction with meta-pretrained self-retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, pages 4084–4100, Toronto, Canada. Association for Computational Linguistics.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Quasar: Datasets for question answering by search and reading.
- Is GPT-3 a good data annotator? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11173–11195, Toronto, Canada. Association for Computational Linguistics.
- Reasoning implicit sentiment with chain-of-thought prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1171–1182, Toronto, Canada. Association for Computational Linguistics.
- Generating sports news from live commentary: A chinese dataset for sports game summarization. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 609–615.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Few-shot in-context learning on knowledge base question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6966–6980, Toronto, Canada. Association for Computational Linguistics.
- Liveqa: A question answering dataset over sports live. In Proceedings of the 19th Chinese National Conference on Computational Linguistics, pages 1057–1067.
- Sentiment analysis of textual comments in field of sport. In 24nd International Electrotechnical and Computer Science Conference (ERK 2015), IEEE, Slovenia.
- OpenAI. 2023. Gpt-4 technical report.
- Predicting in-game actions from interviews of nba players. Computational Linguistics, 46(3):667–712.
- Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651, New Orleans, Louisiana. Association for Computational Linguistics.
- Sportsett: basketball-a robust and maintainable data-set for natural language generation. In Proceedings of the Workshop on Intelligent Information Processing and Natural Language Generation, pages 32–40.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Yuqing Wang and Yun Zhao. 2023. Metacognitive prompting improves understanding in large language models. arXiv preprint arXiv:2308.05342.
- Integrating physiological time series and clinical notes with transformer for early prediction of sepsis. arXiv preprint arXiv:2203.14469.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Vren: Volleyball rally dataset with expression notation language. In 2022 IEEE International Conference on Knowledge Graph (ICKG), pages 337–346. IEEE.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5823–5840, Toronto, Canada. Association for Computational Linguistics.
- Empirical quantitative analysis of covid-19 forecasting models. In 2021 International Conference on Data Mining Workshops (ICDMW), pages 517–526. IEEE.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Collections
Sign up for free to add this paper to one or more collections.