Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do Large Language Models Excel in Complex Logical Reasoning with Formal Language? (2505.16998v1)

Published 22 May 2025 in cs.CL and cs.AI

Abstract: LLMs have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small LLMs, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.

Evaluating Logical Reasoning in LLMs

The research paper titled "Do LLMs Excel in Complex Logical Reasoning with Formal Language?" presents a comprehensive evaluation framework for assessing the effectiveness of LLMs in logical reasoning tasks. The paper primarily focuses on three dimensions: the spectrum of LLM architectures, the taxonomy of logical reasoning tasks, and the formats of reasoning trajectories. It aims to offer a systematic examination of LLM performance, particularly in contexts utilizing formal languages.

Logical reasoning remains a vital area in AI, essential for achieving human-like decision-making and problem-solving abilities. Unlike conventional NLP tasks, logical reasoning necessitates well-defined reasoning paths and derivation chains. The investigation explores whether LLMs, renowned for their natural language prowess, retain comparable capabilities in logical reasoning when expressed through formal languages.

The paper identifies a clear distinction in performance between "Thinking" model variants and "Instruct" models, providing empirical evidence that the former significantly outperforms the latter when employing formal languages. This performance gap is prominently observed in inductive reasoning, where limitations are apparent regardless of the trajectory format. Additionally, the paper identifies different preferences among reasoning tasks for specific trajectory formats. For example, complex numerical and symbolic tasks exhibit a preference for Python (PoT) formats due to their structured nature, whereas first-order logic tasks align closely with Z3 trajectories.

To enhance LLMs, the authors curate formal language-related data, applying a rejected fine-tuning methodology. This approach noticeably improves the generalization abilities of models, allowing them to cover various formal language frameworks effectively. Results from these experiments indicate significant improvements in generalization performance, particularly when training data aligns well with task-specific trajectory requirements.

The distinctiveness of this paper lies in its broad evaluation framework, extensive dataset collection, and the introduction of robust evaluation metrics for both formal and informal languages. By providing a structured evaluation framework, the paper addresses a crucial gap in understanding logical reasoning capabilities across different LLM architectures and reasoning tasks.

The implications of this research extend to future model training strategies, emphasizing the need for comprehensive datasets that incorporate diverse logical structures. Additionally, the observed preferences for certain trajectory formats could guide future trajectory-aware architecture designs, optimizing LLM performance for specific types of reasoning tasks.

In conclusion, this paper contributes to a deeper understanding of LLM capabilities in logical reasoning, providing a nuanced view of their strengths and limitations when employing formal languages. The findings underscore the potential for improvement through targeted data enhancements and suggest avenues for advancing LLM architectures to better handle a wider array of logical reasoning tasks. Future research is poised to investigate these architectural adjustments further and expand on the dataset diversity to encompass emerging symbolic and logic-based challenges in AI reasoning.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jin Jiang (17 papers)
  2. Jianing Wang (50 papers)
  3. Yuchen Yan (44 papers)
  4. Yang Liu (2253 papers)
  5. Jianhua Zhu (12 papers)
  6. Mengdi Zhang (37 papers)
  7. Xunliang Cai (63 papers)
  8. Liangcai Gao (34 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com