Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey (2404.01869v2)

Published 2 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs have recently shown impressive performance on tasks involving reasoning, leading to a lively debate on whether these models possess reasoning capabilities similar to humans. However, despite these successes, the depth of LLMs' reasoning abilities remains uncertain. This uncertainty partly stems from the predominant focus on task performance, measured through shallow accuracy metrics, rather than a thorough investigation of the models' reasoning behavior. This paper seeks to address this gap by providing a comprehensive review of studies that go beyond task accuracy, offering deeper insights into the models' reasoning processes. Furthermore, we survey prevalent methodologies to evaluate the reasoning behavior of LLMs, emphasizing current trends and efforts towards more nuanced reasoning analyses. Our review suggests that LLMs tend to rely on surface-level patterns and correlations in their training data, rather than on sophisticated reasoning abilities. Additionally, we identify the need for further research that delineates the key differences between human and LLM-based reasoning. Through this survey, we aim to shed light on the complex reasoning processes within LLMs.

Evaluating the Reasoning Behavior of LLMs: An In-Depth Analysis

The paper "Beyond Accuracy: Evaluating the Reasoning Behavior of LLMs - A Survey" presents a comprehensive review of the evaluation methodologies targeted towards understanding the reasoning capabilities of LLMs. As advances in these models continue to provoke debate regarding their reasoning prowess, this paper shifts the focus from conventional accuracy metrics to a deeper examination of reasoning processes intrinsic to LLMs.

Central Themes and Findings

The authors highlight the need to move beyond superficial accuracy measures traditionally used to assess model performance in reasoning tasks. Instead, they promote the need for nuanced evaluations that scrutinize the internal reasoning mechanisms employed by LLMs. The review underscores two principal questions: how LLMs perform across diverse reasoning tasks and what evaluation methods are most effective in understanding their reasoning behavior.

The investigation reveals that LLMs often exploit surface-level patterns from their training data rather than engaging in genuine reasoning. This observation aligns with the skepticism surrounding the term "castle in the air," suggesting that observed performances may not rest on a solid foundation of reasoning capabilities but rather on extensive data memorization.

Evaluation Frameworks

To dissect the reasoning abilities, the authors propose a taxonomy of evaluation methodologies consisting of four primary categories:

  1. Conclusion-Based Evaluation - Focuses on the output conclusions produced by the models, analyzing output distribution and errors to infer reasoning behavior. Methods like model confidence assessments are used to gauge how conclusion probabilities align with true confidence in the model's reasoning.
  2. Rationale-Based Evaluation - Investigates the logical structures within the reasoning pathways of models. Techniques like first-order logic conversions and computation graph analyses are employed to dissect and evaluate reasoning traces.
  3. Interactive Evaluations - Involve engaging LLMs in dynamic scenarios to assess adaptivity and resilience. This includes methods that adaptively choose questions based on model responses and dialectic techniques where models defend their beliefs in a conversational setup.
  4. Mechanistic Evaluations - Delve into the internal processing mechanisms of LLMs, analyzing elements like attention patterns and neuron activations to uncover the underlying cognitive pathways in reasoning tasks.

Implications and Future Directions

The paper identifies significant challenges in current LLM reasoning, particularly in out-of-distribution scenarios where models demonstrate notable conceptual errors. This indicates a lack of deep understanding and reasoning rather than robust linguistic capabilities. The mechanistic limitations reveal that current LLMs, trained mainly through language pattern recognition, lack essential components necessary for human-like reasoning.

These findings suggest several implications for both the practical use of LLMs and theoretical understanding of AI capabilities. Practically, it emphasizes the need for refined evaluation metrics and methodologies that extend beyond static accuracy measures, while theoretically, it calls for more hybrid models that incorporate structured reasoning frameworks.

Conclusion

The paper "Beyond Accuracy: Evaluating the Reasoning Behavior of LLMs - A Survey" serves as a vital resource for AI researchers aiming to deepen their understanding of LLM reasoning abilities. By advocating for a shift toward more sophisticated evaluation frameworks, it sets the stage for developing LLMs that can emulate higher levels of cognitive processing, pivotal for achieving true artificial general intelligence. Future work should concentrate on the integration of various evaluation approaches to create more comprehensive tools for analysis, thus enhancing our understanding of LLM reasoning capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (119)
  1. Large language models for mathematical reasoning: Progresses and challenges. In Neele Falk, Sara Papi, and Mike Zhang (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pp.  225–237, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-srw.17.
  2. AlephAlpha. Luminous language model, 2022.
  3. American Psychological Association. Behavior. In APA Dictionary of Psychology, n.d. Retrieved March 15, 2024, from https://dictionary.apa.org/behavior.
  4. Evaluating large language models with neubaroco: Syllogistic reasoning ability and human-like biases. arXiv preprint arXiv:2306.12567, 2023.
  5. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  6. Anthropic. Model card and evaluations for claude models, 2023. Technical Report. Available at: https://www-cdn.anthropic.com/files/4zrzovbb/website/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226.pdf.
  7. Explainable artificial intelligence (xai): Concepts, taxonomies, opportunities and challenges toward responsible ai. Information Fusion, 58:82–115, 2020. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2019.12.012. URL https://www.sciencedirect.com/science/article/pii/S1566253519308103.
  8. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, pp.  610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
  9. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=GPKTIktA0k.
  10. ChatGPT’s information seeking strategy: Insights from the 20-questions game. In C. Maria Keet, Hung-Yi Lee, and Sina Zarrieß (eds.), Proceedings of the 16th International Natural Language Generation Conference, pp.  153–162, Prague, Czechia, September 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.inlg-main.11. URL https://aclanthology.org/2023.inlg-main.11.
  11. On the opportunities and risks of foundation models. ArXiv, 2021. URL https://crfm.stanford.edu/assets/report.pdf.
  12. Ali Borji. Stochastic parrots or intelligent systems? a perspective on true depth of understanding in llms, July 2023. Available at SSRN: https://ssrn.com/abstract=4507038 or http://dx.doi.org/10.2139/ssrn.4507038.
  13. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  14. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  15. Human Reasoning: The Psychology Of Deduction. Psychology Press, 1st edition, 1993. doi: 10.4324/9781315785028. URL https://doi.org/10.4324/9781315785028.
  16. Is bigger and deeper always better? probing llama across scales and layers, 2024a.
  17. Premise order matters in reasoning with large language models. arXiv preprint arXiv:2402.08939, 2024b.
  18. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402, 2023.
  19. Structured, flexible, and robust: Benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 44. Cognitive Science Society, 2022. URL https://escholarship.org/uc/item/3qq6w5kx.
  20. Language models show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051, 2022.
  21. Qlora: Efficient finetuning of quantized llms. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  10088–10115. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/1feb87871436031bdc0f2beaa62a049b-Paper-Conference.pdf.
  22. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  23. Aliya R. Dewey. Arbitrating norms for reasoning tasks. Synthese, 200(6):502, Nov 2022. ISSN 1573-0964. doi: 10.1007/s11229-022-03981-8. URL https://doi.org/10.1007/s11229-022-03981-8.
  24. How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv preprint arXiv:2402.18312, 2024.
  25. Faith and fate: Limits of transformers on compositionality. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  70293–70332. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/deb3c28192f979302c157cb653c15e90-Paper-Conference.pdf.
  26. A systematic comparison of syllogistic reasoning in humans and language models. arXiv preprint arXiv:2311.00445, 2023.
  27. Abductive and Inductive Reasoning: Background and Issues, pp.  1–27. Springer Netherlands, Dordrecht, 2000. ISBN 978-94-017-0606-3. doi: 10.1007/978-94-017-0606-3˙1. URL https://doi.org/10.1007/978-94-017-0606-3_1.
  28. Shane Frederick. Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4):25–42, December 2005. doi: 10.1257/089533005775196732. URL https://www.aeaweb.org/articles?id=10.1257/089533005775196732.
  29. CRASS: A novel data set and benchmark to test counterfactual reasoning of large language models. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp.  2126–2140, Marseille, France, June 2022. European Language Resources Association. URL https://aclanthology.org/2022.lrec-1.229.
  30. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023.
  31. ROSCOE: A suite of metrics for scoring step-by-step reasoning. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=xYlJRpzZtsY.
  32. Detecting blickets: How young children use information about novel causal powers in categorization and induction. Child Development, 71(5):1205–1222, 2000. ISSN 00093920, 14678624. URL http://www.jstor.org/stable/1131970.
  33. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in chatgpt. Nature Computational Science, 3(10):833–838, Oct 2023. ISSN 2662-8457. doi: 10.1038/s43588-023-00527-x. URL https://doi.org/10.1038/s43588-023-00527-x.
  34. Inductive reasoning in humans and large language models. Cognitive Systems Research, 83:101155, 2024. ISSN 1389-0417. doi: https://doi.org/10.1016/j.cogsys.2023.101155. URL https://www.sciencedirect.com/science/article/pii/S1389041723000839.
  35. Applications and challenges in dynamic assessment. Peabody Journal of Education, 77(2):40–63, 2002. doi: 10.1207/S15327930PJE7702_5. URL https://doi.org/10.1207/S15327930PJE7702_5.
  36. Measuring mathematical problem solving with the MATH dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URL https://openreview.net/forum?id=7Bywt2mQsCe.
  37. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  38. The Cambridge Handbook of Thinking and Reasoning. Cambridge Handbooks in Psychology. Cambridge University Press, 2005. ISBN 9780521824170. URL https://books.google.de/books?id=znbkHaC8QeMC.
  39. Leon Horsten. Philosophy of Mathematics. In Edward N. Zalta and Uri Nodelman (eds.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Winter 2023 edition, 2023.
  40. Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  4902–4919, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.299. URL https://aclanthology.org/2023.emnlp-main.299.
  41. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  1049–1065, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.67. URL https://aclanthology.org/2023.findings-acl.67.
  42. Clomo: Counterfactual logical modification with large language models. arXiv preprint arXiv:2311.17438, 2024.
  43. Instructed to bias: Instruction-tuned language models exhibit emergent cognitive bias. arXiv preprint arXiv:2308.00225, 2023.
  44. CLadder: A benchmark to assess causal reasoning capabilities of language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=e2wtjx0Yqu.
  45. Can large language models infer causation from correlation? In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=vqIH0ObdqL.
  46. P. N. Johnson-Laird. Mental models and human reasoning. Proc Natl Acad Sci U S A, 107(43):18243–18250, 2010. doi: 10.1073/pnas.1012933107.
  47. Philip Johnson-Laird. How We Reason. Oxford University Press, 10 2008. ISBN 9780199551330. doi: 10.1093/acprof:oso/9780199551330.001.0001. URL https://doi.org/10.1093/acprof:oso/9780199551330.001.0001.
  48. P.N. Johnson-Laird. How We Reason. Oxford University Press, 2006. ISBN 9780198569763. URL https://books.google.de/books?id=UjYsJN0krNYC.
  49. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  50. On belief bias in syllogistic reasoning. Psychological Review, 107(4):852–884, 2000. doi: 10.1037/0033-295X.107.4.852. URL https://doi.org/10.1037/0033-295X.107.4.852.
  51. Large language models are zero-shot reasoners. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  22199–22213. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/8bb0d291acd4acf06ef112099c16f326-Paper-Conference.pdf.
  52. Robert Koons. Defeasible Reasoning. In Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford University, Summer 2022 edition, 2022.
  53. Towards understanding how machines can learn causal overhypotheses. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 45. Cognitive Science Society, 2023. URL https://escholarship.org/uc/item/9q29w1xh.
  54. Jacqueline P. Leighton. Defining and Describing Reason, pp.  3–11. Cambridge University Press, 2003.
  55. Alessandro Lenci. Understanding natural language understanding systems. a critical analysis. arXiv preprint arXiv:2303.04229, 2023.
  56. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7871–7880, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.703. URL https://aclanthology.org/2020.acl-main.703.
  57. Counterfactual reasoning: Testing language models’ understanding of hypothetical scenarios. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  804–815, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.70. URL https://aclanthology.org/2023.acl-short.70.
  58. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023.
  59. Mathematical language models: A survey. arXiv preprint arXiv:2312.07622, 2024.
  60. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  61. Intelligence and Reasoning, pp.  419–441. Cambridge Handbooks in Psychology. Cambridge University Press, 2011.
  62. Dissociating language and thought in large language models. Trends in Cognitive Sciences, 2024.
  63. Jitendra Malik. Workshop on foundation models. https://www.youtube.com/watch?v=dG628PEN1fY&t=15830s, August 2021. YouTube video.
  64. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  65. Inverse scaling: When bigger isn’t better. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=DwgRm72GQF. Featured Certification.
  66. Augmented language models: a survey. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=jh7wH2AzKK. Survey Certification.
  67. Melanie Mitchell. Can large language models reason? AI: A Guide for Thinking Humans, September 2023. URL https://aiguide.substack.com/p/can-large-language-models-reason. Accessed on: 13th of March 2024.
  68. The debate over understanding in ai’s large language models. Proceedings of the National Academy of Sciences, 120(13):e2215907120, 2023. doi: 10.1073/pnas.2215907120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2215907120.
  69. Comparing inferential strategies of humans and large language models in deductive reasoning. arXiv preprint arXiv:2402.14856, 2024.
  70. Associative processes in intuitive judgment. Trends in Cognitive Sciences, 14(10):435–440, 2010. doi: 10.1016/j.tics.2010.07.004.
  71. OpenAI. Chatgpt: Optimizing language models for dialogue, 2022. Available at: https://openai.com/blog/chatgpt.
  72. Gpt-4 technical report, 2024.
  73. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
  74. David Papo. How can we study reasoning in the brain? Frontiers in Human Neuroscience, 9:222, 2015. doi: 10.3389/fnhum.2015.00222.
  75. Judea Pearl. Causality. Cambridge University Press, 2 edition, 2009.
  76. The book of why: the new science of cause and effect. Basic books, 2018.
  77. Assessing logical reasoning capabilities of encoder-only transformer models. arXiv preprint arXiv:2312.11720, 2023.
  78. ReCEval: Evaluating reasoning chains via correctness and informativeness. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  10066–10086, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.622. URL https://aclanthology.org/2023.emnlp-main.622.
  79. Reasoning with language model prompting: A survey. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5368–5393, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.294. URL https://aclanthology.org/2023.acl-long.294.
  80. Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=bNt7oajl2a.
  81. Language models are unsupervised multitask learners, 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
  82. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  83. Impact of pretraining term frequencies on few-shot numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  840–854, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.59. URL https://aclanthology.org/2022.findings-emnlp.59.
  84. Lance J. Rips. The psychology of knights and knaves. Cognition, 31(2):85–116, 1989. ISSN 0010-0277. doi: https://doi.org/10.1016/0010-0277(89)90019-X. URL https://www.sciencedirect.com/science/article/pii/001002778990019X.
  85. John Alan Robinson and Andrei Voronkov (eds.). Handbook of Automated Reasoning (in 2 volumes). Elsevier and MIT Press, 2001. ISBN 0-444-50813-9. URL https://www.sciencedirect.com/book/9780444508133/handbook-of-automated-reasoning.
  86. Thinking like a skeptic: Defeasible inference in natural language. In Trevor Cohn, Yulan He, and Yang Liu (eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  4661–4675, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.418. URL https://aclanthology.org/2020.findings-emnlp.418.
  87. RobustLR: A diagnostic benchmark for evaluating logical robustness of deductive reasoners. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9614–9631, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.653. URL https://aclanthology.org/2022.emnlp-main.653.
  88. Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In International Conference on Learning Representations, 2023.
  89. Testing the general deductive reasoning capacity of large language models using ood examples. Advances in Neural Information Processing Systems, 36, 2024.
  90. Murray Shanahan. Talking about large language models. Commun. ACM, 67(2):68–79, jan 2024. ISSN 0001-0782. doi: 10.1145/3624724. URL https://doi.org/10.1145/3624724.
  91. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, 2023. URL https://api.semanticscholar.org/CorpusID:256459776.
  92. Steven Sloman. Causal Models: How People Think about the World and Its Alternatives. Oxford University Press, 08 2005. ISBN 9780195183115. doi: 10.1093/acprof:oso/9780195183115.001.0001. URL https://doi.org/10.1093/acprof:oso/9780195183115.001.0001.
  93. Functional benchmarks for robust evaluation of reasoning performance, and the reasoning gap. arXiv preprint arXiv:2402.19450, 2024.
  94. A causal framework to quantify the robustness of mathematical reasoning with language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  545–561, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.32. URL https://aclanthology.org/2023.acl-long.32.
  95. A survey of reasoning with foundation models. arXiv preprint arXiv:2312.11562, 2024.
  96. Gemini: A family of highly capable multimodal models, 2023.
  97. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  98. Language models are not naysayers: an analysis of language models on negation benchmarks. In Alexis Palmer and Jose Camacho-collados (eds.), Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), pp.  101–114, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.starsem-1.10. URL https://aclanthology.org/2023.starsem-1.10.
  99. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131, 1974. doi: 10.1126/science.185.4157.1124. URL https://www.science.org/doi/abs/10.1126/science.185.4157.1124.
  100. Strategies in sentential reasoning. Cognitive Science, 26(4):425–468, 2002. ISSN 0364-0213. URL https://www.sciencedirect.com/science/article/pii/S0364021302000745.
  101. Is einstein’s puzzle over-specified? In AJ75th Symposium: 21st Century Challenges in Computational Engineering & Science, Princeton University, Princeton, NJ, November 2009.
  102. Vicuna. Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. https://vicuna.lmsys.org/, 2023. Accessed: 2024-03-15.
  103. A & b == b & a: Triggering logical reasoning failures in large language models. arXiv preprint arXiv:2401.00757, 2024.
  104. Gpt-j-6b: A 6 billion parameter autoregressive language model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021. URL https://github.com/kingoflolz/mesh-transformer-jax.
  105. Can ChatGPT defend its belief in truth? evaluating LLM reasoning via debate. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  11865–11881, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.795. URL https://aclanthology.org/2023.findings-emnlp.795.
  106. Fac22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTe: Better understanding large language model capabilities by dissociating language and cognition. arXiv preprint arXiv:2403.00126, 2024.
  107. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  24824–24837. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf.
  108. Can foundation models talk causality? In UAI 2022 Workshop on Causal Representation Learning, 2022. URL https://openreview.net/forum?id=DbJXEqU0kaM.
  109. Using cognitive interviews and think-aloud protocols to understand thought processes. Currents in Pharmacy Teaching and Learning, 13(2):181–188, 2021. ISSN 1877-1297. doi: https://doi.org/10.1016/j.cptl.2020.09.005. URL https://www.sciencedirect.com/science/article/pii/S1877129720303026.
  110. D. J. WOOD. Approach to the study of human reasoning. Nature, 223(5201):101–102, Jul 1969. ISSN 1476-4687. doi: 10.1038/223101a0. URL https://doi.org/10.1038/223101a0.
  111. Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477, 2023.
  112. Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841, 2023.
  113. Logical reasoning over natural language as knowledge representation: A survey. arXiv preprint arXiv:2303.12023, 2023.
  114. Language models as inductive reasoners. arXiv preprint arXiv:2212.10923, 2024.
  115. Natural language reasoning, a survey. arXiv preprint arXiv:2303.14725, 2023a.
  116. IfQA: A dataset for open-domain question answering under counterfactual presuppositions. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023b. URL https://openreview.net/forum?id=V49Jx2Lj04.
  117. Causal parrots: Large language models may talk causality but are not causal. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=tv46tCzs83.
  118. On the paradox of learning to reason from data. arXiv preprint arXiv:2205.11502, 2022.
  119. Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. arXiv preprint arXiv:2306.10512, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Philipp Mondorf (9 papers)
  2. Barbara Plank (130 papers)
Citations (20)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com