Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Large Language Models do Analytical Reasoning? (2403.04031v1)

Published 6 Mar 2024 in cs.CL and cs.AI

Abstract: This paper explores the cutting-edge LLM with analytical reasoning on sports. Our analytical reasoning embodies the tasks of letting LLMs count how many points each team scores in a quarter in the NBA and NFL games. Our major discoveries are in two folds. Firstly, we find among all the models we employed, GPT-4 stands out in effectiveness, followed by Claude-2.1, with GPT-3.5, Gemini-Pro, and Llama-2-70b lagging behind. Specifically, we compare three different prompting techniques and a divide-and-conquer approach, we find that the latter was the most effective. Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer approach, we also explore the Chain of Thought (CoT) strategy, which markedly improves outcomes for certain models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing significantly. However, the CoT strategy has negligible or even detrimental effects on the performance of other models like GPT-3.5 and Gemini-Pro. Secondly, to our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores. This leads us to further investigate the factors that impact the complexity of analytical reasoning tasks with extensive experiments, through which we conclude that task complexity depends on the length of context, the information density, and the presence of related information. Our research provides valuable insights into the complexity of analytical reasoning tasks and potential directions for developing future LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Anthropic. Introducing Claude 2.1. https://www.anthropic.com/index/claude-2-1, 2023. Accessed on: Nov 21, 2023.
  2. Skills-in-context prompting: Unlocking compositionality in large language models, 2023.
  3. Finqa: A dataset of numerical reasoning over financial data, 2022.
  4. Training verifiers to solve math word problems, 2021.
  5. Beyond english-centric multilingual machine translation, 2020.
  6. Measuring massive multitask language understanding, 2021.
  7. MeetingBank: A benchmark dataset for meeting summarization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  16409–16423, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.906. URL https://aclanthology.org/2023.acl-long.906.
  8. Sportsmetrics: Blending text and numerical data to understand information fusion in llms, 2024.
  9. Generating sports news from live commentary: A Chinese dataset for sports game summarization. In Kam-Fai Wong, Kevin Knight, and Hua Wu (eds.), Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp.  609–615, Suzhou, China, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.aacl-main.61.
  10. Llama 2: Open foundation and fine-tuned chat models, 2023.
  11. Sports-qa: A large-scale video question answering benchmark for complex and professional sports, 2024.
  12. Let’s verify step by step, 2023.
  13. Mmc: Advancing multimodal chart understanding with large-scale instruction tuning, 2023a.
  14. Gpt understands, too. AI Open, 2023b. ISSN 2666-6510. doi: https://doi.org/10.1016/j.aiopen.2023.08.012. URL https://www.sciencedirect.com/science/article/pii/S2666651023000141.
  15. On learning to summarize with large language models as references, 2023c.
  16. Pivoine: Instruction tuning for open-world information extraction, 2023.
  17. Gpt-driver: Learning to drive with gpt, 2023.
  18. Chartqa: A benchmark for question answering about charts with visual and logical reasoning, 2022.
  19. MistralAI. Mixtral of experts. https://mistral.ai/news/mixtral-of-experts/, 2023. Accessed on: December 11, 2023.
  20. OpenAI. Chatgpt. https://chat.openai.com/chat, 2023a. Large language model.
  21. OpenAI. New models and developer products announced at DevDay. https://openai.com/blog/new-models-and-developer-products-announced -at-devday, 2023b. Accessed on: November 6, 2023.
  22. Training language models to follow instructions with human feedback, 2022.
  23. Are nlp models really able to solve simple math word problems?, 2021.
  24. Trustllm: Trustworthiness in large language models, 2024.
  25. Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family, 2023.
  26. Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
  27. A comprehensive survey of hallucination mitigation techniques in large language models, 2024.
  28. Knowledge enhanced sports game summarization, 2021.
  29. Goal: Towards benchmarking few-shot sports game summarization, 2022.
  30. Finetuned language models are zero-shot learners, 2022.
  31. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  32. Adapting large language models for document-level machine translation, 2024.
  33. More human than human: Llm-generated narratives outperform human-llm interleaved narratives. In Proceedings of the 15th Conference on Creativity and Cognition, C&C ’23, pp.  368–370, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400701801. doi: 10.1145/3591196.3596612. URL https://doi.org/10.1145/3591196.3596612.
  34. Take a step back: Evoking reasoning via abstraction in large language models, 2023.
  35. Solving math word problems via cooperative reasoning induced language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  4471–4485, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.245. URL https://aclanthology.org/2023.acl-long.245.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yebowen Hu (9 papers)
  2. Kaiqiang Song (32 papers)
  3. Sangwoo Cho (22 papers)
  4. Xiaoyang Wang (134 papers)
  5. Hassan Foroosh (48 papers)
  6. Dong Yu (328 papers)
  7. Fei Liu (232 papers)
Citations (2)

HackerNews