Taxonomy of Reasoning Failures in LLMs
- Taxonomy of Reasoning Failures in LLMs is a structured framework categorizing logical, procedural, and cognitive error modes observed in large language models.
- The framework is built upon empirical evaluations and diagnostic approaches that quantify performance drops in logical inference, procedural diagnosis, and representational consistency.
- These classifications inform actionable mitigation strategies, including symbolic reasoning modules and meta-cognitive scaffolding, to enhance LLM reliability.
LLMs have demonstrated remarkable performance on a range of reasoning tasks but also exhibit well-documented, systematic reasoning failures. Understanding these failures requires precise taxonomies grounded in empirical evaluations, formal consistency definitions, domain-specific diagnostics, and cognitive frameworks. Recent research delineates distinct, reproducible reasoning failure modes in LLMs, spanning both generic logical inference and application-specific settings. Taxonomies described in the literature are increasingly fine-grained, covering logical, cognitive, procedural, linguistic, and representational dimensions, and reveal persistent limitations that are robust to changes in model scale or architecture.
1. Logical Reasoning Failures: Formal Taxonomies and Empirical Patterns
LLMs' failures in logical reasoning are rigorously analyzed via controlled stress-testing frameworks. In "Less Is More for Multi-Step Logical Reasoning...", two primary logical failure classes are isolated:
- Essential Rule Deletion: Omission of a rule that breaks the inference chain required for correct conclusion propagation. All tested models exhibit catastrophic drops to chance accuracy (25%) when essential links are removed, highlighting sensitivity to missing evidence.
- Contradictory Evidence Injection: Introduction of explicit contradictions to the premises. All models collapse to 0% accuracy, demonstrating an inability to prioritize or reconcile conflicting information (Bao et al., 6 Dec 2025).
This dichotomy is summarized as invariance to semantic-preserving logical transformations (e.g., paraphrasing, compression, single- or multi-law equivalence rewrites, which are handled robustly) versus brittleness to structural disruptions (missing or conflicting evidence).
Complementary comprehensive taxonomies extend these observations:
| Failure Type | Definition | Example |
|---|---|---|
| Deductive, Inductive, Abductive Failure | Inability to draw conclusion, generalize or abduce plausible cause | Incorrect or “unknown” answers on logical chains |
| Negation/Implication/Transitivity Failure | Violation of basic logical consistency across queries | Assigns both and , fails rules |
| Compositional-Consistency Failure | Inconsistent joint assignments when combining facts | Fails conjunctions; output not stable to perturbation |
| Factuality-Consistency Failure | Answers contradict external knowledge or KBs | Factual error on established knowledge |
| Compositional Reasoning Failure | Fails to chain multiple facts or steps; only isolated premises handled | Two-hop tasks drop to random or below chance |
Such evaluations are augmented by ATP-based frameworks that label reasoning chains into ideal (all steps sound, valid path), robust-but-impure (spurious steps), incomplete, invalid, or errant (Zheng et al., 29 Dec 2025).
2. Procedural and Meta-Reasoning Failure Modes
Diagnostic frameworks in application domains (e.g., root-cause analysis, code repair, legal reasoning) reveal a rich spectrum of procedural reasoning failures. In cloud-based RCA, a taxonomy of 16 unique failure codes includes:
- Procedural: Anchoring bias (RF-13), repetition/failure to resume (RF-12), arbitrary evidence selection (RF-07), failure to update belief (RF-09), simulation/role confusion (RF-10), excessive speculation (RF-11), internal contradictions (RF-15), arithmetic/aggregation error (RF-16).
- Domain-Specific: Metric interpretation error (RF-02), confused provenance (RF-03), unjustified specificity (RF-06), arbitrary evidence selection (RF-07).
- General: Fabricated evidence (RF-01), temporal misordering (RF-04), spurious causality (RF-05), evidential insufficiency (RF-08) (Riddell et al., 29 Jan 2026).
Cross-domain analyses show that anchoring, chain stalling, and arbitrary evidence selection strongly predict incorrect diagnoses, a pattern echoed in legal and step-intensive domains (Mishra et al., 8 Feb 2025). Notably, meta-cognitive controls such as self-awareness, context awareness, and goal management remain underutilized, with real impact on robust problem-solving (Kargupta et al., 20 Nov 2025).
3. Consistency and Representation-Level Failures
Research on self-consistency identifies two critical axes:
- Hypothetical Consistency: Failure to recognize or predict own outputs in alternate, semantically equivalent prompt contexts; empirical hypothetical consistency rates remain far below 100%.
- Compositional Consistency: Instability under substitution of correct intermediate results into composed prompts, with consistency rates lagging behind correctness by 5–10 percentage points even in best models (Chen et al., 2023).
Tokenization-level artifacts are separately identified as a source of "phantom edit" failures—where distinct token sequences detokenize to the same output, so that the model "thinks" it performed a substitution without surface change. Eight artifact classes (e.g., whitespace-boundary shift, intra-word resegmentation, acronym splits) account for ∼10% of observed replacement errors in controlled probing, invariant to model scale (Ayoobi et al., 21 Jan 2026).
4. Cognitive and Exploratory Failures: State-Tracking and Strategic Planning
Analyses through a cognitive lens decompose LLM reasoning into 28 foundational elements across four super-categories:
- Computational Constraints: Logical coherence, compositionality, productivity.
- Meta-Cognitive Controls: Self-awareness, context integration, dynamic strategy selection, goal management, evaluation.
- Representational Schemas: Sequential, hierarchical, network, causal, temporal, spatial structures.
- Transformation Operations: Verification, selective attention, decomposition/integration, forward/backward chaining, backtracking (Kargupta et al., 20 Nov 2025).
LLMs display heavy over-reliance on shallow, forward-only chaining and linear sequential organization, while strategic skills like hierarchical nesting, meta-evaluation, and adaptive representation are conspicuously rare—particularly in ill-structured tasks.
Additionally, in complex search or problem-exploration settings, LLMs are shown to be "wandering solution explorers" rather than systematic problem solvers. Failure taxa include:
- Invalid explorations: Boundary violations, omitted procedures, incorrect backtracking.
- Unnecessary explorations: State revisitation, infinite self-loops.
- Evaluation errors: State staleness, execution (arithmetic) error, unfaithful conclusions (Lu et al., 26 May 2025).
These error modes manifest as collapsed coverage and necessity metrics, especially with increasing problem depth or solution-space cardinality.
5. Domain-Specific Reasoning Failures: Programming, Math, and Multilinguality
Reasoning failures in code execution simulation are rigorously annotated as follows:
| Category | Share (approx.) | Characteristic Failure |
|---|---|---|
| Computation Errors | 39–54% | Arithmetic/misc. miscalculation |
| Indexing Errors | 3–12% | Off-by-one/mis-slicing |
| Control Flow Errors | 2–9% | Loop/conditional predicate mistakes |
| Skipping Statements | 3–12% | Missing an update in loops/blocks |
| Misreporting Final Output | 4–13% | Correct reasoning but mis-prints final result |
| Input Misread | 5–8% | Incorrect parsing of input, ignoring constraints/types |
| Native API Misevaluation | <5% | Misuse or wrong assumption about a built-in function |
| Hallucination | ~3–4% | Fabricating values/behavior ungrounded in the code |
| Lack of Verification/Logic | ~5–10% | Skipped checks, unverified assumptions, or contradictory steps |
External tool support corrects almost 60% of computation errors, but integration and logic-following errors are more resistant to this remedy (Abdollahi et al., 28 Nov 2025).
In mathematical reasoning across low-resource languages, error patterns diverge by problem type: basic arithmetic is robust, but unit-conversion, logical deduction, and optimization tasks degrade sharply in Sinhala/Tamil, revealing latent translation-dependence and corpus limitations (Kishanthan et al., 16 Feb 2026).
6. Structural Taxonomy Synthesis and Mitigation Directions
Survey work synthesizes failure typologies along orthogonal axes of reasoning type (embodied, informal, formal) and error nature (fundamental, application-specific, robustness):
| Reasoning Type / Failure Mode | Fundamental Failures | Application-Specific Limitations | Robustness Issues |
|---|---|---|---|
| Informal (Intuitive) | Working memory collapse, cognitive bias | Theory-of-Mind, moral/norm instability | Order/framing effects, superficial pattern |
| Formal (Logical) | Reversal curse, compositional failure | Domain-specific logical/coding errors | MCQ order collapse, tokenization artifacts |
| Embodied | Commonsense grounding gaps | Spatial/affordance errors | Visual or code distractors, scene edits |
Recommended mitigations include architectural innovations (symbolic/neuro-symbolic modules, explicit rule/provenance tracing, paraconsistent logic), training on adversarial/counterfactual scenarios, tool-augmentation, and meta-cognitive reward shaping. Explicit scaffolding (prompted plans, agent-based feedback, or reflective checkpoints) and diverse representational curricula (graph/temporal/spatial reasoning tasks) are advocated to move LLMs beyond shallow pattern completion (Song et al., 5 Feb 2026, Kargupta et al., 20 Nov 2025).
7. Implications and Future Research
Research consistently identifies a sharp dichotomy: LLMs generalize smoothly across superficial semantic variants but remain brittle under logical, evidential, and representational perturbations that disrupt the information or reasoning structure. The deficiencies uncovered are not artifactually tied to particular architectures, data scales, or prompt templates but arise from fundamental constraints in the underlying learning and representation regimes. Taxonomies have matured toward actionable frameworks capable of supporting comparative evaluation, error-driven model refinement, and domain transfer.
Open directions include extending fine-grained diagnostic frameworks to nested quantification, probabilistic and modal logics, causal graph reasoning, and temporally grounded scenarios; integrating multi-agent or cross-modal scaffolding; and unifying multiple consistency constraints efficiently in efficient learning and inference schemas (Bao et al., 6 Dec 2025, Cheng et al., 21 Feb 2025, Song et al., 5 Feb 2026).
This comprehensive view bridges formal, procedural, representational, and cognitive perspectives, enabling the LLM community to design, evaluate, and repair reasoning—addressing both performance at scale and reliability under stress.