Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models (2404.15522v2)

Published 23 Apr 2024 in cs.CL and cs.AI
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

Abstract: Recently developed LLMs have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at https://github.com/Mihir3009/LogicBench.

Logical Reasoning Evaluation of LLMs through LogicBench

Logical reasoning has been a focal area in the development and assessment of artificial intelligence, particularly in the domain of LLMs such as GPT-4, ChatGPT, and Google Gemini. The paper presents “LogicBench,” a dataset designed to rigorously assess the logical reasoning capabilities of these models. The systematic evaluation includes a comprehensive range of 25 distinct inference rules across propositional, first-order, and non-monotonic logics, thereby addressing previous shortcomings in the evaluation of logical reasoning in LLMs.

Dataset and Methodology

The creation of LogicBench underlined the necessity for a systematic and diverse evaluation dataset specifically for logical reasoning. The paper provides a meticulous description of a three-stage data generation process involving sentence generation, natural language conversion, and task instance formulation. LogicBench offers distinct tasks: Binary Question-Answering (BQA) and Multiple-Choice Question-Answering (MCQA), allowing for a nuanced analysis of logical reasoning across varying contexts and logical complexities.

The authors evaluated several prominent LLMs using LogicBench, employing chain-of-thought prompting to measure the accuracy of model predictions. This approach provided a detailed view of each model's strengths and limitations in handling logical reasoning tasks.

Main Findings and Performance Analysis

The experimental results reveal that current LLMs show significant room for improvement in logical reasoning tasks, particularly when handling complex reasoning and negations. For example, the models grapple with inference rules containing negations, such as Modus Tollens, indicating a need for enhanced understanding of logical constructions involving negative premises.

Notably, the paper found disparities in logic type performance, with LLMs generally excelling in handling non-monotonic logic tasks over propositional and first-order logics. This is attributed, in part, to the natural language character of non-monotonic reasoning being more prevalent in pre-trained data of LLMs, making it less challenging for the models.

Implications and Future Directions

LogicBench sets a new standard for systematically evaluating logical reasoning in LLMs. Its diverse inference rules and reasoning patterns facilitate a more thorough understanding of model capabilities and limitations. Insights garnered from this research advocate for future developments in AI that address these gaps, especially in enhancing LLMs' comprehension of complex logical constructs and operations.

The paper also illuminated the potential benefits of fine-tuning LLMs using LogicBench to improve logical reasoning capabilities, as demonstrated by improved model performance on other logical reasoning datasets such as LogiQA and LogicNLI. Further research could extend the evaluation to richer logical combinations and multi-step reasoning tasks, enhancing the depth of logical reasoning capabilities assessed.

Conclusion

This paper contributes significantly to the ongoing exploration of LLM capabilities by focusing on the nuanced field of logical reasoning. Through LogicBench, the authors provide a critical tool for benchmarking and advancing LLMs' logical reasoning abilities. Their work highlights crucial areas for development and enrichment, paving the way for next-generation AI systems with more robust and reliable logical reasoning skills.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Jean-Philippe Bernardy and Stergios Chatzikyriakidis. 2020. Fracas: Temporal analysis.
  2. Logical reasoning for task oriented dialogue systems. In Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), pages 68–79, Dublin, Ireland. Association for Computational Linguistics.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  4. Adaprompt: Adaptive prompt-based finetuning for relation extraction. arXiv e-prints, pages arXiv–2104.
  5. Transformers as soft reasoners over language. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3882–3890.
  6. Template-based named entity recognition using BART. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1835–1845, Online. Association for Computational Linguistics.
  7. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. Avia Efrat and Omer Levy. 2020. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982.
  9. “john is 50 years old, can his son be 65?” evaluating NLP models’ understanding of feasibility. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 407–417, Dubrovnik, Croatia. Association for Computational Linguistics.
  10. Towards general purpose vision systems. arXiv preprint arXiv:2104.00743.
  11. Teaching temporal logics to neural networks. In International Conference on Learning Representations.
  12. Folio: Natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840.
  13. Peter Hase and Mohit Bansal. 2022. When can models learn from explanations? a formal framework for understanding the roles of explanation data. In Proceedings of the First Workshop on Learning with Natural Language Supervision, pages 29–39, Dublin, Ireland. Association for Computational Linguistics.
  14. Mistral 7b. arXiv preprint arXiv:2310.06825.
  15. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  16. TaxiNLI: Taking a ride up the NLU hill. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 41–55, Online. Association for Computational Linguistics.
  17. Daniel Khashabi. 2019. Reasoning-Driven Question-Answering for Natural Language Understanding. University of Pennsylvania.
  18. Less is more: Summary of long instructions is better for program synthesis. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4532–4552, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  19. Teven Le Scao and Alexander M Rush. 2021. How many data points is a prompt worth? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2627–2636.
  20. The Winograd Schema Challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, pages 552–561. AAAI Press, Rome, Italy.
  21. Vladimir Lifschitz. 1989. Benchmark problems for formal nonmonotonic reasoning: Version 2.00. In Non-Monotonic Reasoning: 2nd International Workshop Grassau, FRG, June 13–15, 1988 Proceedings 2, pages 202–219. Springer.
  22. Few-shot learning with multilingual language models. arXiv preprint arXiv:2112.10668.
  23. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 3622–3628.
  24. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
  25. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
  26. Biotabqa: Instruction learning for biomedical table question answering. arXiv preprint arXiv:2207.02419.
  27. The SICK (Sentences Involving Compositional Knowledge) dataset for relatedness and entailment.
  28. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference.
  29. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943.
  30. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612, Dublin, Ireland. Association for Computational Linguistics.
  31. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773.
  32. NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, Dublin, Ireland. Association for Computational Linguistics.
  33. Swaroop Mishra and Elnaz Nouri. 2023. HELP ME THINK: A simple prompting strategy for non-experts to create customized content with models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11834–11890, Toronto, Canada. Association for Computational Linguistics.
  34. OpenAI. 2023. Gpt-4 technical report.
  35. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
  36. In-BoXBART: Get instructions into biomedical multi-task learning. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 112–128, Seattle, United States. Association for Computational Linguistics.
  37. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  38. Is a question decomposition unit all we need? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4553–4569, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  39. RuleBERT: Teaching soft rules to pre-trained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1460–1476, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  40. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
  41. Multitask prompted training enables zero-shot task generalization. ICLR.
  42. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In The Eleventh International Conference on Learning Representations.
  43. Timo Schick and Hinrich Schütze. 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online. Association for Computational Linguistics.
  44. CLUTRR: A diagnostic benchmark for inductive reasoning from text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4506–4515, Hong Kong, China. Association for Computational Linguistics.
  45. Quarel: A dataset and models for answering questions about qualitative relationships. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 7063–7071.
  46. ProofWriter: Generating implications, proofs, and abductive statements over natural language. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 3621–3634, Online. Association for Computational Linguistics.
  47. QuaRTz: An open-domain dataset of qualitative relationship questions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5941–5946, Hong Kong, China. Association for Computational Linguistics.
  48. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  49. Diagnosing the first-order logical reasoning ability through LogicNLI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3738–3747, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  51. Cad: the contextual abuse dataset.
  52. Instructionner: A multi-task instruction-based generative framework for few-shot ner. arXiv preprint arXiv:2203.03903.
  53. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  54. Finetuned language models are zero-shot learners. ICLR.
  55. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  56. Learning from task descriptions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1361–1375, Online. Association for Computational Linguistics.
  57. Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–22.
  58. Can neural networks understand monotonicity reasoning? In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 31–40, Florence, Italy. Association for Computational Linguistics.
  59. Qinyuan Ye and Xiang Ren. 2021. Learning to generate task-specific adapters from task description. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 646–653, Online. Association for Computational Linguistics.
  60. Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3914–3923, Hong Kong, China. Association for Computational Linguistics.
  61. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
  62. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations.
  63. On the paradox of learning to reason from data. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pages 3365–3373.
  64. A survey of large language models. arXiv preprint arXiv:2303.18223.
  65. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2856–2878, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  66. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3363–3369, Hong Kong, China. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mihir Parmar (25 papers)
  2. Nisarg Patel (8 papers)
  3. Neeraj Varshney (47 papers)
  4. Mutsumi Nakamura (5 papers)
  5. Man Luo (55 papers)
  6. Santosh Mashetty (3 papers)
  7. Arindam Mitra (40 papers)
  8. Chitta Baral (152 papers)
Citations (6)
Youtube Logo Streamline Icon: https://streamlinehq.com