A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

Published 17 Jun 2024 in cs.CL | (2406.11341v3)

Abstract: The reasoning abilities of LLMs are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as $\textit{content effects}$, avoid answering that $\textit{no conclusion follows}$, display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper finds that LLMs, when using zero-shot chain-of-thought, mimic human biases and struggle with invalid syllogisms.
The study shows that in-context learning enhances performance via schema diversity, yet fails to fully overcome content bias.
Supervised fine-tuning significantly improves reasoning accuracy by emphasizing deductive logic over lexical content.

A Systematic Analysis of LLMs as Soft Reasoners in Syllogistic Inferences

Introduction

The paper "A Systematic Analysis of LLMs as Soft Reasoners: The Case of Syllogistic Inferences" (2406.11341) critically examines the reasoning capabilities of LLMs within the context of syllogistic reasoning—a classical domain of deductive reasoning in logic and psychology. Despite LLMs' rapid advancements in natural language processing, their inherent biases, particularly content-related and multi-step reasoning difficulties, continue to challenge their reliability as reasoning agents. This research dissects how chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) can influence and potentially overcome these biases, with the ultimate aim of determining whether LLMs possess latent reasoning abilities.

Figure 1: LLMs have difficulty with invalid inferences (Top); suffer from content effects (Middle); and struggle with longer chains of premises (Bottom).

Methodological Framework

The study employs a systematic experimental design involving LLMs such as Pythia and LLaMA models, with specific emphasis on validating syllogistic reasoning competencies. The methodology is structured around three core learning strategies:

Zero-Shot Chain-of-Thought (ZS-CoT) Prompting: Examines models' performance without prior task examples, leveraging prompts to encourage deductive reasoning processes.
In-Context Learning (ICL): Analyzes the impact of demonstrating diverse syllogistic examples in either in-context schema-aligned (ICL $_{in}$ ) or schema-diverse (ICL $_{out}$ ) settings.
Supervised Fine-Tuning (SFT): Evaluates models fine-tuned on syllogistic tasks with pseudo-word data, focusing on consistency and transferability of learned reasoning skills.
Figure 2: Multiple-choice Task The model is given the premises and nine possible conclusions, and has to generate the correct one(s). $\textrm{ICL}_{in}$ and $\textrm{ICL}_{out}$ show schema-based examples and diverse examples respectively.

Results and Analysis

The paper reveals significant disparities in reasoning performance among different training strategies. Key findings include:

Zero-Shot CoT Setting: Pre-trained models demonstrate reasoning patterns analogous to human biases but struggle notably with invalid syllogisms and content effects, with a marked reluctance to conclude "nothing follows" even when appropriate.
ICL Performance: ICL $_{out}$ settings indicate some performance enhancement by schema diversity, yet they fail to decisively mitigate content effect biases without an increase in conclusion inconsistencies.
Supervised Fine-Tuning: Fine-tuned models achieve near-ceiling performance, showcasing robustness in handling both valid inferences and invalid syllogisms consistently. SFT emerges as notably effective in eliminating reliance on lexical content, focusing instead on deductive reasoning forms.
Figure 3: Results of chi^2's Test on Content Effect. Statistical significance confirms content bias mitigation through SFT.

Heuristic Analysis

The study examines heuristic models from cognitive science literature, such as the Atmosphere Theory, to interpret LLM behavior. The findings suggest that, in pre-trained settings, LLMs often default to mood-based heuristic reasoning, thereby explaining their hesitance to affirm "nothing follows." Supervised Fine-Tuning presents a promising route to overcome these heuristic biases.

Figure 4: Heuristics predictions. Proportion of mistakes and correct conclusions generated by models that are predicted by heuristic theories.

Conclusion

This research delineates the nuances of LLMs' reasoning capabilities within syllogistic frameworks, identifying content bias and reasoning consistency as pivotal challenges. It underscores supervised fine-tuning as a potential pathway to enhance LLM reasoning fidelity by prioritizing form over content. Future studies are encouraged to expand the scope across more model families and incorporate adversarial testing to refine syllogistic reasoning metrics further. The implications of this work entail significant advancements not only in NLP but also in the broader design of AI systems with enhanced reasoning faculties.

Markdown