- The paper introduces the SmartyPat-Bench dataset and neural-symbolic framework to benchmark LLM logical reasoning by generating fallacious statements through Prolog-based test oracles.
- The authors detail a three-stage process—predicate design, LLM-guided knowledge generation, and natural language realisation—to ensure both syntactic and semantic fidelity.
- Experimental results reveal significant improvements in fallacy generation quality and categorization, emphasizing the method's potential for enhancing AI robustness and interpretability.
Logic Reasoning Evaluation in LLMs via Neural-Symbolic Oracle Benchmarking
Motivation and Dataset Construction
Evaluating LLMs on logic reasoning tasks remains challenging due to shortcomings in existing benchmarks, which are typically constrained by synthetic, formalistic constructions and trivial or linguistically unnatural examples. Previous datasets (FOLIO, LogicBench, LOGIC, COIG-CQIA, RuozhiBench) either rely on symbolic logic operators or direct translations from non-English sources, resulting in semantic misalignment and poor coverage of complex, real-world logical fallacies.
To address these deficiencies, the paper introduces SmartyPat-Bench, a dataset consisting of 502 natively English, real-world posts collected from Reddit (specifically r/ShittyAskScience). Each example was manually selected, filtered, and precisely annotated with fine-grained fallacy types. The annotation taxonomy covers 14 distinct logical fallacies, constructing a resource that enables challenging evaluation and detailed categorization.
A notable challenge in assembling such datasets is class imbalance, where user-generated content disproportionately represents certain types (e.g., False Premise, Equivocation, False Analogy), leaving more nuanced or rare fallacies underrepresented. Additionally, the manual annotation process is labor intensive, requiring over 60 hours for labeling and transformation.
Neural-Symbolic Oracle Generation: The SmartyPat Framework
The proposed SmartyPat framework automates the synthesis of logically fallacious statements via logic programming-based test oracles, leveraging Prolog as the symbolic backbone. The process follows three stages:
- Prolog Predicate Design: For each fallacy type's schema, Prolog predicates and rules are constructed to encode characteristic reasoning errors. These predicates systematically specify the core structural properties underlying each fallacy, enabling formal verification and compositional knowledge construction.
- Knowledge Generation via LLMs: Predicate templates, facts, and rules are presented to an LLM (e.g., Claude 3.7) as few-shot prompts, guiding the generation of instantiations that inhabit the logic space of each fallacy. Predicate-grouped prompting with inline annotations optimizes semantic alignment in LLM outputs.
- Natural Language Realization: Prolog queries extract valid combinations, which are then reformulated by LLMs into fluent, implication-constrained sentences. This step guarantees both syntactic and semantic fidelity with the original fallacy definitions.
The result is a synthetic dataset, SmartyPat-Bench-Augmented, exhibiting high demographic diversity, subtlety, and structural accuracy. Semantic similarity analysis confirms that the augmented samples are not paraphrastic variants of the original content but are genuinely novel.
Experimental Evaluation and Analysis
Fallacy Generation Quality
Three generation methods are benchmarked: FallacyGen-Direct (LLM direct prompt), FallacyGen-Prolog (LLM instructed to produce Prolog + NL), and SmartyPat. Sentence quality is evaluated by cross-model scoring (Claude 3.7 generates, GPT-4o scores; temperature fixed at 0), mapped to a 0–3 scale.
SmartyPat consistently achieves the highest scores across all fallacy types, with large percentage improvements over baselines (e.g., IE: +58.88%, IT: +62.16%, FC: +52.54%, FS: +45.08%). The approach reduces low-quality outputs and excels especially in structurally complex categories (e.g., Fallacy of Composition). LLMs perform well on trivial, linguistically superficial fallacies (Equivocation, Nominal, False Dilemma) with direct prompting, but fail at semantically demanding ones without symbolic scaffolding.
Nine state-of-the-art models are evaluated in detection (fallacy existence) and categorization (label assignment) tasks on both real and synthetic benchmarks, using full prompt definitions and structured scoring.
Detection: Contrary to expectation, non-reasoning models (DeepSeek V3, Grok-2) outperform reasoning models (Claude 3.7, GPT-o3-mini) in F1 score. Reasoning models exhibit high false positive rates (FPR), often due to overanalysis and confirmation bias—mislabeling benign statements as fallacious. FPR is universally high, FNR is near-zero, and the synthetic dataset is perceived comparably challenging to the real-world benchmark.
Categorization: Reasoning models dominate this task, with scored ranking functions reflecting both the correctness and rank of predicted labels. Scores are higher for synthetic data due to explicit structural clues. GPT-o3-mini displays superior balance, with reasoning models reliably mapping structure to fallacy types. Conservative predictions (Grok-2) statistically minimize penalty via lower label count.
Fallacy-wise analysis shows surface-level causal and analogical fallacies (False Cause, False Analogy) are easier for all models, while context-dependent, compositionally ambiguous types (Contextomy, Improper Transposition, Improper Distribution) pose greater difficulty.
Semantic Diversity and Reliability
Embedding-based cosine similarity analysis with text-embedding-3-large confirms the uniqueness of generated augmented fallacies (≈0.16 cross-set similarity, <0.20 internal). Sentence quality evaluators achieve substantial agreement with human annotation (κ>0.75).
Practical and Theoretical Implications
The neural-symbolic test oracle methodology demonstrated here represents a scalable approach for robust, nuanced benchmarking of LLM logical reasoning. The integration of Prolog predicates enables controllable generation and precise structural guarantees, while LLMs provide linguistic fluency and diversity.
This paradigm supports verifiable and explainable test cases, critical for trustworthy LLM deployment in settings requiring reasoning transparency and reliability. It also enables future research into explanation-based evaluation, automated oracle soundness validation, and expanded logic schema taxonomies.
Reasoning behavior in LLMs reveals structural misalignment and sensitivity bias in detection, while symbolically scaffolded generation markedly improves categorization capabilities. Overanalysis and confirmation bias in advanced models highlight the need for explicit presentation of task context and negative class balancing.
Conclusion
This work introduces SmartyPat-Bench and SmartyPat, combining native English, real-world annotated data and neurally guided symbolic generation for evaluating LLM logic reasoning. Automated oracle generation is shown to produce high-quality, diverse, semantically novel fallacies, far exceeding baseline methods. Comprehensive evaluation of nine LLMs establishes that structural scaffolding via symbolic programming enhances categorization, while simple prompting favors detection. These results underscore the utility of neural-symbolic oracles for advancing AI robustness, interpretability, and practical deployment.