Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts (2305.12477v2)

Published 21 May 2023 in cs.CL and cs.AI

Abstract: LLMs have exhibited remarkable performance on various NLP tasks. However, there is a current hot debate regarding their reasoning capacity. In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks across eleven distinct datasets. Our paper provides empirical evidence showcasing the superior performance of ChatGPT-4 in comparison to both ChatGPT-3.5 and BARD in zero-shot setting throughout almost all evaluated tasks. While the superiority of GPT-4 compared to GPT-3.5 might be explained by its larger size and NLP efficiency, this was not evident for BARD. We also demonstrate that the three models show limited proficiency in Inductive, Mathematical, and Multi-hop Reasoning Tasks. To bolster our findings, we present a detailed and comprehensive analysis of the results from these three models. Furthermore, we propose a set of engineered prompts that enhances the zero-shot setting performance of all three models.

Evaluation of Reasoning Abilities of LLMs in Zero-Shot Settings

The paper "GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts" addresses a pivotal concern in the current landscape of NLP research: the reasoning capabilities of LLMs. As LLMs like GPT-3.5, GPT-4, and Google's BARD continue to outperform in traditional NLP tasks, the ability of these models to perform reasoning tasks remains contentious. This paper rigorously evaluates the reasoning capabilities of these models in a zero-shot setting using a broad array of reasoning tasks spanning deductive, inductive, abductive, commonsense, causal, and multi-hop reasoning through evaluations across eleven distinct datasets.

Summary of Findings

  1. Evaluation Across Reasoning Tasks: The paper employs a comprehensive methodological framework to assess the performance of GPT-3.5, GPT-4, and BARD over a suite of eleven datasets designed to challenge various types of reasoning. Notably, the evaluation results reveal that ChatGPT-4 consistently outperforms both GPT-3.5 and BARD across most reasoning categories. However, a common limitation is observed in Inductive, Mathematical, and Multi-hop Reasoning Tasks where performance improves marginally or remains constrained across models.
  2. Prompt Engineering: The authors propose a set of engineered prompts tailored to enhance the models' performance in a zero-shot setting. Empirical evidence from the experiments indicates that these engineered prompts significantly improve the models' reasoning performance, suggesting the potential of strategic prompting to unlock latent reasoning capabilities in LLMs.
  3. Reproducibility and Public Availability: Unlike prior studies, this research emphasizes transparency and reproducibility by making samples publicly available and ensuring that the test suite can be fully reproduced on all three evaluated models. This openness facilitates further exploration and model comparison within the research community.

Implications and Future Directions

  • Theoretical Implications: The findings highlight the stratified reasoning abilities among different LLMs, correlating model size and architecture with performance outcomes. This nuanced understanding aids in refining theories surrounding model scaling, data-driven learning, and reasoning proficiency within machine learning frameworks.
  • Practical Applications: Given the limitations exhibited in tasks requiring nuanced multi-step logic or abstract inference, future endeavors should focus on integrating reasoning-enhancing architectures or specialized training datasets aimed at addressing these deficits.
  • Speculative Future of AI: The paper suggests a direction towards better reasoning through improved prompting techniques. As a consequence, the research community might explore a hybrid approach combining enhanced CoT techniques, rationale engineering, or rationale verification strategies to enable more coherent logical processing within models.

The paper provides an empirical benchmark, offering an insightful exploration into the reasoning capabilities of LLMs. As AI continues its trajectory towards autonomous reasoning, this paper underscores the critical importance of interdisciplinary research efforts aimed at bridging the chasm between symbolic and statistical reasoning paradigms in AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Mindreaders: the cognitive basis of" theory of mind". Psychology Press.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023 .
  3. Abductive commonsense reasoning. 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 .
  4. PIQA: reasoning about physical commonsense in natural language. CoRR abs/1911.11641.
  5. Language models are few-shot learners, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 1877–1901. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv abs/2303.12712.
  7. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 .
  8. Palm: Scaling language modeling with pathways. ArXiv abs/2204.02311.
  9. Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems 30. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 .
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .
  12. Explaining answers with entailment trees. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing .
  13. Mathematics, word problems, common sense, and artificial intelligence. arxiv doi:10.48550/arXiv.2301.09723.
  14. Improving the teaching of hypothesis testing using a divide-and-conquer strategy and content exposure control in a gamified environment. Mathematics 8, 2244.
  15. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 .
  16. e-CARE: a new dataset for exploring explainable causal reasoning. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) .
  17. Mathematical reasoning: Analogies, metaphors, and images. Routledge.
  18. Mathematical capabilities of chatgpt. arxiv doi:10.48550/arXiv.2301.13867.
  19. Mathematical capabilities of chatgpt. arXiv .
  20. Complexity-based prompting for multi-step reasoning. ArXiv abs/2210.00720.
  21. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597 .
  22. Inductive reasoning. Wiley interdisciplinary reviews: Cognitive science 1, 278–292.
  23. Properties of inductive reasoning. Psychonomic Bulletin & Review 7, 569–592.
  24. Training Compute-Optimal Large Language Models. arXiv e-prints .
  25. Towards reasoning in large language models: A survey. arXiv:2212.10403 .
  26. Large language models can self-improve. arXiv preprint arXiv:2210.11610 .
  27. Deductive reasoning. Wiley Interdisciplinary Reviews: Cognitive Science 1, 8–17.
  28. Designing effective supports for causal reasoning. Educational Technology Research and Development 56, 287–308. doi:10.1007/s11423-006-9021-6.
  29. Decomposed prompting: A modular approach for solving complex tasks. The Eleventh International Conference on Learning Representations URL: https://openreview.net/forum?id=_nGgzQjzaRy.
  30. Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems URL: https://arxiv.org/pdf/2205.11916.pdf.
  31. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems URL: https://openreview.net/forum?id=IFXTZERXdM7.
  32. On the advance of making language models better reasoners. arXiv doi:10.48550/arXiv.2206.02336.
  33. What makes good in-context examples for GPT-3? Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures .
  34. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55. URL: https://doi.org/10.1145/3560815, doi:10.1145/3560815.
  35. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 .
  36. An overview of bard: an early experiment with generative ai .
  37. OpenIA, 2022. Introducing chatgpt. https://openai.com/blog/chatgpt.
  38. OpenIA, 2023a. Gpt-4 technical report. arxiv URL: https://arxiv.org/pdf/2303.08774.pdf.
  39. OpenIA, 2023b. Openia. https://openai.com/.
  40. Training language models to follow instructions with human feedback. ArXiv abs/2203.02155.
  41. Approaches to abductive reasoning: an overview. Artificial intelligence review 7, 109–152.
  42. Improving language understanding by generative pre-training. arxiv .
  43. Language models are unsupervised multitask learners. arxiv URL: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
  44. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv e-prints , arXiv:2112.11446doi:10.48550/arXiv.2112.11446, arXiv:2112.11446.
  45. Commonsense reasoning for natural language processing, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, Association for Computational Linguistics, Online. pp. 27--33. URL: https://aclanthology.org/2020.acl-tutorials.7, doi:10.18653/v1/2020.acl-tutorials.7.
  46. Analysing mathematical reasoning abilities of neural models. International Conference on Learning Representations URL: https://openreview.net/forum?id=H1gR5iR5FX.
  47. The evolution of mathematical reasoning: Everyday versus idealized understandings. Developmental Review 22, 242--266.
  48. Beyond inductive and deductive reasoning: The search for a sense of knowing. Educational Studies in mathematics 30, 197--210.
  49. CLUTRR: A diagnostic benchmark for inductive reasoning from text. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) .
  50. CommonsenseQA: A question answering challenge targeting commonsense knowledge. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) .
  51. Lamda: Language models for dialog applications. arXiv .
  52. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 .
  53. Multi-hop reading comprehension across multiple documents by reasoning over heterogeneous graphs. arXiv preprint arXiv:1905.07374 .
  54. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). Workshop at Neural Information Processing System URL: https://arxiv.org/pdf/2206.10498.pdf.
  55. Abductive reasoning. University of Alabama Press.
  56. Modeling semantic plausibility by injecting world knowledge. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) .
  57. Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations URL: https://openreview.net/forum?id=1PL1NIMMrw.
  58. Emergent abilities of large language models. Transactions on Machine Learning Research .
  59. Chain of thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems URL: https://openreview.net/forum?id=_VjQlMeSB_J.
  60. Towards ai-complete question answering: A set of prerequisite toy tasks. 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings .
  61. HotpotQA: A dataset for diverse, explainable multi-hop question answering. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing .
  62. STar: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems URL: https://openreview.net/forum?id=_3ELRdg2sgI.
  63. Automatic chain of thought prompting in large language models. The Eleventh International Conference on Learning Representations URL: https://openreview.net/forum?id=5NTt8GFjUHkr.
  64. Teaching algorithmic reasoning via in-context learning. URL: https://openreview.net/forum?id=6dlC7E1H_9.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
Citations (74)