Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles

Published 9 Apr 2025 in cs.CL | (2504.12312v3)

Abstract: LLMs have achieved significant progress in language understanding and reasoning. Evaluating and analyzing their logical reasoning abilities has therefore become essential. However, existing datasets and benchmarks are often limited to overly simplistic, unnatural, or contextually constrained examples. In response to the growing demand, we introduce SmartyPat-Bench, a challenging, naturally expressed, and systematically labeled benchmark derived from real-world high-quality Reddit posts containing subtle logical fallacies. Unlike existing datasets and benchmarks, it provides more detailed annotations of logical fallacies and features more diverse data. To further scale up the study and address the limitations of manual data collection and labeling - such as fallacy-type imbalance and labor-intensive annotation - we introduce SmartyPat, an automated framework powered by logic programming-based oracles. SmartyPat utilizes Prolog rules to systematically generate logically fallacious statements, which are then refined into fluent natural-language sentences by LLMs, ensuring precise fallacy representation. Extensive evaluation demonstrates that SmartyPat produces fallacies comparable in subtlety and quality to human-generated content and significantly outperforms baseline methods. Finally, experiments reveal nuanced insights into LLM capabilities, highlighting that while excessive reasoning steps hinder fallacy detection accuracy, structured reasoning enhances fallacy categorization performance.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces the SmartyPat-Bench dataset and neural-symbolic framework to benchmark LLM logical reasoning by generating fallacious statements through Prolog-based test oracles.
The authors detail a three-stage process—predicate design, LLM-guided knowledge generation, and natural language realisation—to ensure both syntactic and semantic fidelity.
Experimental results reveal significant improvements in fallacy generation quality and categorization, emphasizing the method's potential for enhancing AI robustness and interpretability.

Logic Reasoning Evaluation in LLMs via Neural-Symbolic Oracle Benchmarking

Motivation and Dataset Construction

Evaluating LLMs on logic reasoning tasks remains challenging due to shortcomings in existing benchmarks, which are typically constrained by synthetic, formalistic constructions and trivial or linguistically unnatural examples. Previous datasets (FOLIO, LogicBench, LOGIC, COIG-CQIA, RuozhiBench) either rely on symbolic logic operators or direct translations from non-English sources, resulting in semantic misalignment and poor coverage of complex, real-world logical fallacies.

To address these deficiencies, the paper introduces SmartyPat-Bench, a dataset consisting of 502 natively English, real-world posts collected from Reddit (specifically r/ShittyAskScience). Each example was manually selected, filtered, and precisely annotated with fine-grained fallacy types. The annotation taxonomy covers 14 distinct logical fallacies, constructing a resource that enables challenging evaluation and detailed categorization.

A notable challenge in assembling such datasets is class imbalance, where user-generated content disproportionately represents certain types (e.g., False Premise, Equivocation, False Analogy), leaving more nuanced or rare fallacies underrepresented. Additionally, the manual annotation process is labor intensive, requiring over 60 hours for labeling and transformation.

Neural-Symbolic Oracle Generation: The SmartyPat Framework

The proposed SmartyPat framework automates the synthesis of logically fallacious statements via logic programming-based test oracles, leveraging Prolog as the symbolic backbone. The process follows three stages:

Prolog Predicate Design: For each fallacy type's schema, Prolog predicates and rules are constructed to encode characteristic reasoning errors. These predicates systematically specify the core structural properties underlying each fallacy, enabling formal verification and compositional knowledge construction.
Knowledge Generation via LLMs: Predicate templates, facts, and rules are presented to an LLM (e.g., Claude 3.7) as few-shot prompts, guiding the generation of instantiations that inhabit the logic space of each fallacy. Predicate-grouped prompting with inline annotations optimizes semantic alignment in LLM outputs.
Natural Language Realization: Prolog queries extract valid combinations, which are then reformulated by LLMs into fluent, implication-constrained sentences. This step guarantees both syntactic and semantic fidelity with the original fallacy definitions.

The result is a synthetic dataset, SmartyPat-Bench-Augmented, exhibiting high demographic diversity, subtlety, and structural accuracy. Semantic similarity analysis confirms that the augmented samples are not paraphrastic variants of the original content but are genuinely novel.

Experimental Evaluation and Analysis

Fallacy Generation Quality

Three generation methods are benchmarked: FallacyGen-Direct (LLM direct prompt), FallacyGen-Prolog (LLM instructed to produce Prolog + NL), and SmartyPat. Sentence quality is evaluated by cross-model scoring (Claude 3.7 generates, GPT-4o scores; temperature fixed at 0), mapped to a 0–3 scale.

SmartyPat consistently achieves the highest scores across all fallacy types, with large percentage improvements over baselines (e.g., IE: +58.88%, IT: +62.16%, FC: +52.54%, FS: +45.08%). The approach reduces low-quality outputs and excels especially in structurally complex categories (e.g., Fallacy of Composition). LLMs perform well on trivial, linguistically superficial fallacies (Equivocation, Nominal, False Dilemma) with direct prompting, but fail at semantically demanding ones without symbolic scaffolding.

LLM Detection and Categorization Performance

Nine state-of-the-art models are evaluated in detection (fallacy existence) and categorization (label assignment) tasks on both real and synthetic benchmarks, using full prompt definitions and structured scoring.

Detection: Contrary to expectation, non-reasoning models (DeepSeek V3, Grok-2) outperform reasoning models (Claude 3.7, GPT-o3-mini) in F1 score. Reasoning models exhibit high false positive rates (FPR), often due to overanalysis and confirmation bias—mislabeling benign statements as fallacious. FPR is universally high, FNR is near-zero, and the synthetic dataset is perceived comparably challenging to the real-world benchmark.

Categorization: Reasoning models dominate this task, with scored ranking functions reflecting both the correctness and rank of predicted labels. Scores are higher for synthetic data due to explicit structural clues. GPT-o3-mini displays superior balance, with reasoning models reliably mapping structure to fallacy types. Conservative predictions (Grok-2) statistically minimize penalty via lower label count.

Fallacy-wise analysis shows surface-level causal and analogical fallacies (False Cause, False Analogy) are easier for all models, while context-dependent, compositionally ambiguous types (Contextomy, Improper Transposition, Improper Distribution) pose greater difficulty.

Semantic Diversity and Reliability

Embedding-based cosine similarity analysis with text-embedding-3-large confirms the uniqueness of generated augmented fallacies ( $\approx0.16$ cross-set similarity, $<0.20$ internal). Sentence quality evaluators achieve substantial agreement with human annotation ( $\kappa > 0.75$ ).

Practical and Theoretical Implications

The neural-symbolic test oracle methodology demonstrated here represents a scalable approach for robust, nuanced benchmarking of LLM logical reasoning. The integration of Prolog predicates enables controllable generation and precise structural guarantees, while LLMs provide linguistic fluency and diversity.

This paradigm supports verifiable and explainable test cases, critical for trustworthy LLM deployment in settings requiring reasoning transparency and reliability. It also enables future research into explanation-based evaluation, automated oracle soundness validation, and expanded logic schema taxonomies.

Reasoning behavior in LLMs reveals structural misalignment and sensitivity bias in detection, while symbolically scaffolded generation markedly improves categorization capabilities. Overanalysis and confirmation bias in advanced models highlight the need for explicit presentation of task context and negative class balancing.

Conclusion

This work introduces SmartyPat-Bench and SmartyPat, combining native English, real-world annotated data and neurally guided symbolic generation for evaluating LLM logic reasoning. Automated oracle generation is shown to produce high-quality, diverse, semantically novel fallacies, far exceeding baseline methods. Comprehensive evaluation of nine LLMs establishes that structural scaffolding via symbolic programming enhances categorization, while simple prompting favors detection. These results underscore the utility of neural-symbolic oracles for advancing AI robustness, interpretability, and practical deployment.

Markdown Report Issue