Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

51 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

33 2

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models (2404.15522v2)

Published 23 Apr 2024 in cs.CL and cs.AI

Abstract: Recently developed LLMs have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to 'logical reasoning' has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at https://github.com/Mihir3009/LogicBench.

PDF HTML Abstract

Logical Reasoning Evaluation of LLMs through LogicBench

Logical reasoning has been a focal area in the development and assessment of artificial intelligence, particularly in the domain of LLMs such as GPT-4, ChatGPT, and Google Gemini. The paper presents “LogicBench,” a dataset designed to rigorously assess the logical reasoning capabilities of these models. The systematic evaluation includes a comprehensive range of 25 distinct inference rules across propositional, first-order, and non-monotonic logics, thereby addressing previous shortcomings in the evaluation of logical reasoning in LLMs.

Dataset and Methodology

The creation of LogicBench underlined the necessity for a systematic and diverse evaluation dataset specifically for logical reasoning. The paper provides a meticulous description of a three-stage data generation process involving sentence generation, natural language conversion, and task instance formulation. LogicBench offers distinct tasks: Binary Question-Answering (BQA) and Multiple-Choice Question-Answering (MCQA), allowing for a nuanced analysis of logical reasoning across varying contexts and logical complexities.

The authors evaluated several prominent LLMs using LogicBench, employing chain-of-thought prompting to measure the accuracy of model predictions. This approach provided a detailed view of each model's strengths and limitations in handling logical reasoning tasks.

Main Findings and Performance Analysis

The experimental results reveal that current LLMs show significant room for improvement in logical reasoning tasks, particularly when handling complex reasoning and negations. For example, the models grapple with inference rules containing negations, such as Modus Tollens, indicating a need for enhanced understanding of logical constructions involving negative premises.

Notably, the paper found disparities in logic type performance, with LLMs generally excelling in handling non-monotonic logic tasks over propositional and first-order logics. This is attributed, in part, to the natural language character of non-monotonic reasoning being more prevalent in pre-trained data of LLMs, making it less challenging for the models.

Implications and Future Directions

LogicBench sets a new standard for systematically evaluating logical reasoning in LLMs. Its diverse inference rules and reasoning patterns facilitate a more thorough understanding of model capabilities and limitations. Insights garnered from this research advocate for future developments in AI that address these gaps, especially in enhancing LLMs' comprehension of complex logical constructs and operations.

The paper also illuminated the potential benefits of fine-tuning LLMs using LogicBench to improve logical reasoning capabilities, as demonstrated by improved model performance on other logical reasoning datasets such as LogiQA and LogicNLI. Further research could extend the evaluation to richer logical combinations and multi-step reasoning tasks, enhancing the depth of logical reasoning capabilities assessed.

Conclusion

This paper contributes significantly to the ongoing exploration of LLM capabilities by focusing on the nuanced field of logical reasoning. Through LogicBench, the authors provide a critical tool for benchmarking and advancing LLMs' logical reasoning abilities. Their work highlights crucial areas for development and enrichment, paving the way for next-generation AI systems with more robust and reliable logical reasoning skills.

PDF Markdown Bookmark Chat (Pro)

References (66)

Authors (8)

Mihir Parmar (25 papers)
Nisarg Patel (8 papers)
Neeraj Varshney (47 papers)
Mutsumi Nakamura (5 papers)
Man Luo (55 papers)
Santosh Mashetty (3 papers)
Arindam Mitra (40 papers)
Chitta Baral (152 papers)

Citations (6)

View on Semantic Scholar

Tweets

https://twitter.com/fly51fly/status/1783615157907927173

https://twitter.com/Mihir3009/status/1788311882627260921

https://twitter.com/knishimae0531/status/1783631754844074302

YouTube

Show All Videos