- The paper introduces a three-stage pipeline that separates label prediction from formal verification, enabling machine-checkable evaluation of LLM-generated fallacy reasoning.
- It employs Lean4 to formalize candidate explanations and uses an iterative repair mechanism to enhance logical consistency and derivability.
- Empirical results show over 90% of outputs are formally verified despite only 20% matching human annotations, highlighting evaluation gaps in fallacy detection.
Introduction
The challenge of evaluating logical reasoning capabilities of LLMs remains a critical open issue in the assessment of argument understanding and logical robustness. Traditional approaches in logical fallacy detection focus on label prediction, often disregarding the consistency or formal validity of the underlying model-generated rationale. "ForEx: A Formal Verification Framework for Explainable Reasoning in Logical Fallacy Detection and Annotation" (2606.21867) advances the field by introducing a systematic evaluation methodology that decouples prediction correctness from the formal derivability of reasoning chains, enabling machine-verifiable assessments using Lean4 and a structured analysis of annotation consensus.
The ForEx Framework
ForEx operationalizes a three-stage pipeline: generation of fallacy reasoning candidates, formalization and verification via Lean4, and structured label-consistency evaluation. LLMs generate multiple candidate tuples, each comprising a predicted fallacy label, a supporting natural language (NL) explanation, and the initial translation into Lean4-theorem syntax.
An execution-feedback loop iteratively compiles and repairs Lean4 code, constrained to the original explanation to prevent semantic drift, terminating when compilation succeeds or the repair budget is exhausted. Only the logical derivability from encoded premises within Lean4 is checked, not the full faithfulness to the initial NL argument.
This explicit separation between label prediction and formal validity is encoded in the LLM Argument Verification Matrix, which provides a two-dimensional categorization along the axes of Lean4 verification pass/fail and label consistency with human annotation (match/mismatch). Formally, verified but annotation-mismatched outputs (Compilable-Alternative) are not collapsed into errors, enabling capture of plausible but non-canonical model interpretations.
Empirical Results
Experiments utilize the LOGIC-Climate dataset, sampling 107 instances across 13 annotated fallacy categories. ForEx is evaluated using 15 models within the reasoning pipeline, with repair iterations for Lean4 code capped at four steps. The results provide several key findings:
- Over 90% of LLM outputs across models can be translated and verified as valid Lean4 reasoning chains (Compilable-Correct + Compilable-Alternative).
- However, model agreement with human-annotated labels remains consistently low (around 20%), indicating a substantial systematic gap between formal derivability and label matching.
- Thinking model variants, which yield more complex Lean4 structures, have lower initial compilation rates but benefit substantially from a single repair step.
- The Compilable-Alternative (formally verified reasoning with annotation mismatch) category is prevalent, indicating that many model rationales are internally consistent and formally valid yet deviate from annotator label assignments.
- Uncompilable-Correct cases, where a model agrees with annotation but fails formal verification, are rare (<2%), supporting that plausible reasoning rarely emerges in the absence of a derivable proof structure.
Consensus-Guided Annotation and Behavioral Analysis
ForEx incorporates a consensus-guided annotation augmentation pipeline. By aggregating multi-LLM predictions and comparing their formally verified outputs against human annotation sets via Jaccard-based distance, instances of high-consensus model agreement are isolated. Labels with at least 50% verified assignment by models are added for high-consensus examples, achieving a recall of 0.77 against original human annotations. This workflow is conservative and specifically designed to flag plausible alternative labels while controlling annotation quality drift.
Principal Component Analysis of model behavioral signatures reveals most models cluster near human annotator distributions, with very few outliers, confirming pipeline robustness. Disparities in output clustering and label assignment highlight the tension between LLM interpretation under limited context and the reliance of human annotators on broader discourse.
Implications and Limitations
ForEx rigorously demonstrates that label-based evaluation and machine-checked formal reasoning probe distinct facets of LLM behavior. In particular, the predominance of formally verified but annotation-mismatched outputs implies existing annotation schemes may under-capture plausible logical readings, especially in tasks susceptible to multiple interpretive paths. This is further compounded by evidence that many annotation mismatches arise from lack of contextual information in model inputs or inherent ambiguities in fallacy identification.
ForEx does not assert semantic correctness—formal verifiability in Lean4 only certifies derivability under encoded premises, not alignment with natural language meaning. Thus, the risk remains that surface-consistent, formally valid chains do not fully capture the subtlety or intent of the original argument. Additionally, annotation augmentation cannot be equated to annotation improvement absent stronger semantic guarantees.
Future Directions
The findings motivate extending formal verification approaches to incorporate richer argument structures, enforce context-sensitive semantic consistency checks, and further dissect the sources of annotation-model mismatches. There is a clear need for advancing both annotation practices and formal tools capable of bridging the persistent gap between natural language—where ambiguity and contextual dependence are endemic—and formal systems, which enforce explicit reasoning chains.
Conclusion
ForEx advances the evaluation of LLMs in logical fallacy detection by decoupling label prediction from the formal status of reasoning, leveraging Lean4 as a machine-checkable target for model explanations. The systematic exposure of interpretive gaps and limitations in annotation-driven evaluation sharpens the discourse on robust, explainable argument assessment. While formal verification introduces new axes for analyzing LLM reasoning reliability, continued work is required to integrate formal soundness with genuine semantic validity and to refine multi-annotator consensus methodologies.