Papers
Topics
Authors
Recent
Search
2000 character limit reached

Large Language Model Reasoning Failures

Published 5 Feb 2026 in cs.AI, cs.CL, and cs.LG | (2602.06176v1)

Abstract: LLMs have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.

Summary

  • The paper provides a taxonomy categorizing non-embodied vs. embodied reasoning failures in LLMs.
  • It details cognitive, formal, and multimodal deficiencies through benchmark tests and controlled perturbations.
  • The work calls for unified evaluation frameworks and architectural innovations to enhance AI reliability.

Reasoning Failures in LLMs: A Comprehensive Technical Survey

Introduction and Scope

The paper "LLM Reasoning Failures" (2602.06176) delivers a systematic and granular survey of reasoning deficiencies in LLMs, unifying previously fragmented research into a multifaceted taxonomy. The authors construct a two-dimensional framework, categorizing reasoning as non-embodied (informal/intuitive vs. formal/logical) and embodied (requiring physical or multimodal interaction); failure types are classified as fundamental (architecture-rooted), application-specific (domain-constrained), or robustness (lack of invariance to minor perturbations). This structure anchors a broad review that integrates behavioral findings, mechanistic analyses, and mitigation strategies, setting a reference point for future research in robust LLM reasoning. Figure 1

Figure 1: A taxonomy of LLM reasoning failures along reasoning and failure-type axes; each cell groups failure modes by reasoning/robustness category.

Non-Embodied Informal Reasoning Failures

Cognitive Deficit and Bias Replication

The survey identifies the replication and, in many cases, amplification of human-like cognitive limitations in LLMs. Fundamental failures include limited working memory, weak inhibitory control, and poor cognitive flexibility—systematic phenomena observable in controlled psychological probes such as n-back tasks, A-not-B error, and card sorting [han2024incontextlearningelicittrustworthy, gong2024working]. LLMs also underperform in abstract reasoning, failing to generalize or conceptualize beyond pattern matching.

On cognitive biases, the models systematically exhibit confirmation bias, content effects, anchoring, order, and framing biases, as well as group attribution and negativity bias, frequently in excess of levels observed in human judgment [lampinen2024language, itzhak2025plantedpretrainingswayedfinetuning]. These derive from pretraining corpora, transformer causal masking, and RLHF-aligned bias reinforcement. Even interventions such as data curation, adversarial training, and posthoc prompt adjustment only yield limited, context-sensitive improvements; the biases persist under prompt or domain shift.

(Im/Explicit) Social and Moral Reasoning

Research indicates that LLMs are inconsistent and fragile on explicit and implicit social reasoning, notably in Theory of Mind (ToM) tasks, emotional/affective cognition, and social/moral norm inference. For example, even state-of-the-art models exhibit marked performance drop under trivial perturbations to ToM challenges (false belief, perspective-taking, higher-order ToM) [ullman2023large, gu2024simpletom], with model-generated social or moral judgments fluctuating across cultures, prompt phrasing, or instruction [ji2024moralbench, bonagiri2024measuring]. Application-aligned tuning, prompt engineering, and model personality modulation remain ad hoc and domain-specific mitigations.

On explicit multi-agent collaborative settings, failures become more acute: LLM agents struggle with long-horizon planning, joint belief tracking, reliable communication, and safety verification, often due to limited working memory and deficient internal role/belief modeling [li2023theory, pan2025multiagent]. System-level approaches—static inspector/challenger agents, structured role assignment, robust context engineering—provide partial, engineering-heavy solutions, but generalization remains elusive.

Formal and Logical Reasoning Deficiencies

Logic in Language and Benchmarks

Central logical reasoning competence remains an area where LLMs suffer foundational breakdowns. The "reversal curse" (e.g., failing to infer "B is A" from "A is B") is robustly observed in unidirectionally trained transformers and does not reduce with model scale or performance [berglund2023reversal]. Compositional reasoning—integrating multi-hop or multi-fact inferences—remains very shallow even for state-of-the-art models; adding distractors or increasing compositional depth degrades performance substantially [xu2024largelanguagemodelscompositional, zhao2024exploringlimitationslargelanguage].

Targeted analyses using synthetic and controlled natural logic datasets reveal brittle competence in converse, syllogistic, and causal relations and highlight overfitting and superficial pattern-matching [qi2023investigationllmsinefficacyunderstanding, ando2023evaluatinglargelanguagemodels, joshi2024llmspronefallaciescausal]. These deficiencies are amplified when benchmark structures are perturbed with option reorderings, semantic paraphrasing, or entity substitutions: performance drops sharply, exposing critical robustness failures [zheng2023large, pezeshkpour2023large]. The same trend appears in code generation, where minor syntactic or semantic modifications of docstrings or variable names elicit failure [micelibarone2023largerareharderfail, hooda2024largecodemodelsunderstand].

Arithmetic and Mathematical Reasoning

Fundamental arithmetic operations—counting, addition/multiplication, and compositional math word problems—remain unstable. For example, LLMs can miscount single-character tokens and falter on backward reasoning or compositional tasks [yehudai2024transformerscountn, shin2024largelanguagemodelslack], with basic errors persisting across application domains and masked by dataset leakage and overfitting [patel2021nlpmodelsreallyable]. Proposed architectural and training modifications (e.g., digit-order reversal, symbolic augmentation, process supervision) offer only incremental improvements; robust internalization appears out of reach for conventional transformer models.

Embodied and Multimodal Reasoning Weaknesses

1D (Text-Only) Physical Commonsense/Scientific Reasoning

LLMs lack grounded physical world models. They fail simple object attribute and spatial relation questions, confuse basic physics principles, and cannot reason about changes or causality in physical scientific language settings [wang2023newtonlargelanguagemodels, Gregorcic_Pendrill_2023]. Integration with external simulation tools, structured knowledge finetuning, and explicit chain-of-thought prompting can reduce but not eliminate these failures.

2D (Perception-Based) and 3D (Embodied) Reasoning

In VLMs and embodied agents, even current vision-language state-of-the-art models are highly unreliable on basic anomaly detection, spatial arrangement, and physical commonsense in visual scenes [bitton2023breaking, ghaffari2024exploringfailurecasesmultimodal]. Models fail on object overlap, spatial attribute interpretation, and temporal sequence prediction, often over-relying on text tokens or training set frequency. In embodied (robotic/planning) contexts, models generate infeasible, unsafe, or nonsensical plans due to poor affordance estimation or spatial memory limitations—compounded by prompt vulnerability and the inability to recover from execution errors [dao2025alphamaze, li2025embodied, zhang2024badrobot].

Cross-Cutting Robustness and Failure Analysis

A key contribution of this work is the unification of robustness evaluation techniques across domains. Applying invariant-preserving perturbations (prompt structure, minor entity changes, context reordering) reliably exposes brittleness across cognitive, logical, and embodied domains. The survey emphasizes that robustness failures are not superficial but reflect deep architectural and objective misalignments; even with perturbation-based robustness training, overfitting and generalization failures persist.

The absence of unified, evolving failure benchmarks and the lack of persistent tracking of open reasoning challenges inhibit longitudinal progress. The survey advocates for (i) dynamic, private, and event-driven benchmark suites, (ii) adversarial failure case injection for established tasks, and (iii) root-cause analysis that bridges empirical task failures with model internals.

Implications, Theoretical and Practical

The findings forcefully demonstrate that scaling LLMs and augmenting them with external tools or data curation cannot by themselves address intrinsic reasoning limitations. Many reasoning failures are rooted in pretraining objectives, architectural bottlenecks (notably causal attention and autoregressive decoding), and the absence of world-grounded learning. Attempts to align or debias models at the output level are limited by persistent and context-shifting biases inherited from both data and reinforcement learning procedures. While models can outperform humans on some static benchmarks, this paper presents strong evidence that LLMs are not robust, deliberative, or reliable reasoners, particularly in compositional, social, formal, and embodied settings.

These weaknesses have clear impact on AI safety, fairness, and deployment reliability. Critical applications—scientific inference, medical reasoning, robotic autonomy, ethical decision support—are especially threatened by brittle generalization and unrecognized robustness holes.

Future Directions

The systematic taxonomy and analysis in this survey suggest several clear research directions:

  • Mechanistic grounding: Connect behavioral errors to specific network mechanisms (e.g., attention head specialization, intermediate latent structure).
  • Persistent and evolving benchmarks: Develop and maintain longitudinal evaluation suites capturing current and emerging reasoning flaws.
  • Unified robustness evaluation: Standardize perturbation-based approaches across domains, encourage automatic generation and adversarial case synthesis.
  • Architectural innovation: Move beyond standard transformers; explore memory, recurrent, and closed-loop architectures; deepen symbolic/neuro-symbolic integrations.
  • Formal and verifiable reasoning: Expand the use of formal verification, automated theorem proving, and external formal systems for correctness guarantees.

Conclusion

"LLM Reasoning Failures" (2602.06176) establishes a comprehensive, multi-axis taxonomy of LLM reasoning breakdowns, substantiated by evidence across cognitive, logical, and embodied domains. The survey's integration of behavioral, mechanistic, and mitigation analyses highlights persistent, architecture-rooted, and benchmark-sensitive vulnerabilities that challenge both the reliability and generalizability of current models. Progress toward robust, general, and trustworthy reasoning models demands not only scaling and data curation but also fundamental changes in pretraining objectives, model architectures, and the development of unified, evolving evaluation ecosystems.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

What is this paper about?

This paper looks at how LLMs — like the AI that writes, answers questions, or helps with homework — still make surprising reasoning mistakes, even on problems that seem easy to people. The authors collect and organize many studies about these failures, explain why they happen, and discuss what might help fix them. They also share a public list of related research on GitHub to help others explore this topic.

What questions does it ask?

The paper tackles three simple questions:

  • What kinds of reasoning do LLMs struggle with?
  • What kinds of failures do they have, and how are these failures different from each other?
  • Why do these failures happen, and what can we do to reduce them?

To answer these, the authors propose two helpful “maps”:

  • A reasoning map: reasoning can be non-embodied (done in the head, like logic and intuition) or embodied (about the physical world and actions). Non-embodied splits into informal (everyday intuition) and formal (logic, math, code).
  • A failure map: failures can be fundamental (built-in limits of the model), application‑specific (problems in certain domains like social or moral questions), or robustness issues (answers change when small, irrelevant details change).

How did the authors study it?

This is a survey paper. That means the authors:

  • Read and compared many research papers about LLM reasoning errors.
  • Built a clear categorization framework (the “maps” above) to organize all the different failure types.
  • For each failure type, they describe examples, likely causes, and known fixes.
  • Collected links to papers and benchmarks in a GitHub repository for easy access.

Think of it like a guidebook that pulls together scattered evidence and shows the bigger picture.

What did they find?

Here are the main takeaways, grouped by the type of reasoning.

1) Informal reasoning (everyday thinking)

  • Core mental skills are shaky:
    • Limited “working memory”: models lose track of details over long problems or when the situation changes midway.
    • Weak “inhibitory control”: they stick to obvious patterns even when those patterns don’t fit the new context.
    • Low “cognitive flexibility”: they struggle to switch rules quickly or adapt to new instructions.
    • Abstract reasoning is inconsistent: finding hidden rules or dealing with time and symbols can go wrong.
  • Human-like cognitive biases show up:
    • Order and anchoring effects: the first thing you tell the model can overly influence its answer.
    • Framing effects: the same question phrased differently can lead to different answers.
    • Confirmation and popularity biases: they prefer info that matches earlier context or what’s common online.
    • These biases come from training data, model design (like how Transformers read sequences), and human feedback processes.
  • Social intuition (without interaction) is brittle:
    • Theory of Mind (imagining what someone else knows, believes, or feels) is inconsistent. Even small wording changes can make models fail basic tests that young children pass.
    • Moral and social norms reasoning is unstable: answers can be contradictory across similar questions, languages, or cultures. This makes safety-sensitive uses risky.
  • Social reasoning with multiple agents (teams of AIs):
    • Long-term planning breaks down; communication can be unclear; models misread other agents’ goals.
    • Systems are vulnerable to small disruptions or bad actors unless carefully designed with checks, roles, and verification.

Why this happens: Text-only training (no real-world experience), next-token prediction goals (which reward pattern completion over deep thinking), attention limits, biased data, and alignment that can reinforce human biases.

What helps (partially): Chain-of-Thought prompting, better retrieval, fine-tuning, curated data, adversarial training, personality steering, stronger multi-agent protocols, and “context engineering.” These help but don’t fully solve the problems.

2) Formal reasoning (logic, math, and code)

  • Logic in natural language:
    • Reversal curse: a model that learns “A is B” may fail to infer “B is A” (for example, “Paris is the capital of France” vs. “The capital of France is Paris”). This reflects directional training and asymmetries in how models learn.
    • Compositional reasoning: models can handle simple parts but fail when combining them (even just two steps). This shows up in two-hop reasoning, multi-fact checks, and compound math tasks.
  • Logic hidden in benchmarks:
    • Small, logic-preserving changes should not affect answers, but they do. Examples: reordering multiple-choice answers, renaming variables in code, or swapping harmless details in math problems. Big score drops after such changes reveal robustness issues and possible overfitting to public datasets.
  • Arithmetic and mathematics:
    • Counting is harder than it seems for LLMs; tokenization and position encoding play a role.
    • Basic arithmetic breaks under more digits, especially in multiplication. Models often use shortcuts instead of real algorithms.
    • Math Word Problems (MWPs) combine reading with math; models can be thrown off by tiny edits, swapped known/unknowns, or unsolvable trick questions.

Why this happens: The model’s job is to predict the next word, not to “think” in steps like humans do. This encourages pattern-matching, not reliable rules. Architecture choices and training data design matter a lot.

What helps (partially): Training with step-by-step solutions, reversing digit order to focus attention where the math actually happens, editing model internals, adding graph-like reasoning paths, using external tools or neuro-symbolic modules, and improving benchmarks to be contamination-free and transformation-resistant.

3) Embodied reasoning (the physical world)

  • Text-only physical commonsense:
    • Models often misunderstand object properties (size, weight), spatial relations (above, inside), and simple physical laws (gravity, force).
    • Formal physics or scientific problem-solving shows gaps in multi-step reasoning and correct law use.
  • Why this is hard: LLMs learn from text, not from touching or seeing the world. Without real grounding, physical intuition is fragile.
  • What helps: Multimodal inputs (vision, audio), simulations, better scientific training data, and tools can help, but full physical understanding remains a major open challenge.

What is the impact of this work?

  • It gives researchers a shared language and structure to talk about LLM reasoning failures. That makes it easier to spot common patterns, compare results, and build stronger fixes.
  • It warns users and builders not to trust single benchmark scores. Robustness — staying right when the problem is phrased differently — matters.
  • It highlights where today’s models are least reliable: long, multi-step plans; subtle social or moral questions; small logic changes; counting and arithmetic; and anything needing physical intuition.
  • It points to promising directions: better data, smarter training, model architecture changes, multimodal grounding, system-level safeguards, and stronger, evolving benchmarks.

If you want to explore the cited works, the authors provide a curated GitHub list: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures

Final takeaway

LLMs can do amazing things, but they still make predictable reasoning mistakes — sometimes on very simple tasks, and sometimes just because a question was reworded. This paper maps out those weaknesses, explains why they happen, and gathers the best ideas for fixing them. The big message: to build AI you can rely on, you must test for robustness, understand failure modes, and design models and systems that think more carefully, not just more loudly.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, consolidated list of concrete gaps and open questions the paper leaves unresolved, organized via tags to aid actionability and prioritization.

  • [Methodology] Lack of standardized, quantitative definitions and metrics for “reasoning failure” types (fundamental vs application-specific vs robustness), making cross-paper comparisons and meta-analysis difficult.
  • [Methodology] No unified evaluation protocol to disentangle causality of failures (data artifacts vs architecture vs alignment/RLHF vs inference-time prompt/context), limiting targeted mitigation.
  • [Methodology] Absence of longitudinal tracking across model updates (e.g., model/version churn) to determine which failures persist, attenuate, or emerge over time.
  • [Methodology] Insufficient contamination auditing and reporting standards for benchmarks used to diagnose failures, leaving robustness and “true generalization” ambiguous.
  • [Methodology] No community-maintained, perturbation-based evaluation suite with standardized logic-preserving edits across tasks (math, code, commonsense, ToM, etc.) to stress-test stability.
  • [Methodology] Limited mechanistic/causal evidence linking specific internal components (layers, heads, circuits) to each failure mode; most claims are correlational.
  • [Methodology] Missing analyses of trade-offs (e.g., debiasing vs helpfulness, safety vs robustness, CoT vs latency/cost) when mitigating one failure potentially worsens others.
  • [Cognitive] No agreed-upon, task-agnostic probes for working memory, inhibitory control, and cognitive flexibility in LLMs that allow consistent scoring across model families and sizes.
  • [Cognitive] Under-specified links between architectural choices (e.g., attention variants, recurrence, external memory) and improvements in executive functions; few controlled ablations.
  • [Cognitive] Lack of curricula or training signals explicitly designed to grow abstract reasoning and temporal abstraction in a measurable, transferable way.
  • [Bias] Insufficient disentanglement of bias sources: how much arises from pretraining corpus vs positional encodings/causal masking vs RLHF/finetuning vs decoding strategies?
  • [Bias] Poor coverage of cross-lingual/cross-domain cognitive biases (e.g., anchoring, framing) and how they differ with script/tokenization schemes.
  • [Bias] No standardized stress-tests for bias persistence and “rebound” under context shifts, jailbreaks, or adversarially framed prompts.
  • [Bias] Minimal mechanistic interpretability on how and where framing/order/anchoring biases are represented and can be reliably edited without collateral damage.
  • [ToM] Limited evidence that ToM-like abilities survive small perturbations and transfer to interactive, dynamic multi-turn settings; need for benchmarks that jointly test stability and generalization.
  • [ToM] Open question whether improved ToM from prompting (e.g., CoT) reflects genuine internal modeling or surface-level decomposition; few tests isolate “latent” ToM.
  • [Emotion/EI] Sparse evaluation of emotional intelligence across cultures and modalities (text, audio, vision), and poor grounding of affect recognition to behaviorally consistent outputs.
  • [Norms/Morals] No principled framework for learning/calibrating culturally sensitive moral norms that remain consistent across tasks and languages without exacerbating bias.
  • [Norms/Morals] Lack of mitigation approaches that yield robust, generalizable moral reasoning beyond task- or dataset-specific gains (RLHF-like “surface compliance” remains brittle).
  • [Norms/Morals] Need for adversarial robustness evaluations specifically tailored to moral/ethical queries (e.g., norm ambiguities, trap questions, oblique jailbreaks).
  • [MAS] Missing standardized, reproducible long-horizon multi-agent benchmarks with clear termination, verification, and adversarial-agent protocols.
  • [MAS] Unclear best practices for internal belief-tracking and hypothesis-testing in agents (what representations/mechanisms actually improve coordination vs noise?).
  • [MAS] Insufficient evidence that context-engineering, inspectors/critics, or role-structuring generalize across domains without heavy hand-engineering.
  • [MAS] No formal safety guarantees or theoretical analyses for failure cascades in MAS (e.g., bounds on escalation, convergence, or recovery under perturbations).
  • [Logic-NL] Reversal curse: open questions around universality across families (decoder-only vs encoder-decoder vs recurrent hybrids) and whether bidirectional objectives fully resolve it at scale.
  • [Logic-NL] Limited theory on when and why reversal-sensitive facts remain asymmetric after anti-causal/bidirectional pretraining or post-hoc finetuning.
  • [Logic-NL] Compositional reasoning: lack of validated measures for “latent compositionality” (vs CoT-style explicit steps) and how to train it reliably without large compute.
  • [Logic-NL] Few studies map specific attention/circuit patterns to composition failures and demonstrate stable, cross-dataset patching/editing that endures distribution shift.
  • [Logic-Specific] Limited, comprehensive coverage of targeted logical relations (e.g., quantifiers, modal logic, counterfactuals) with contamination-free datasets and cross-lingual variants.
  • [Benchmarks] No widely adopted “meta-benchmark” that applies uniform, logic-preserving transformations to popular tasks and reports a consolidated robustness score.
  • [Benchmarks] Sparse evaluation of sensitivity to prompt verbosity, narrative perspective, and distractors under controlled, factorial designs.
  • [Benchmarks] Limited assessment frameworks to separate real reasoning improvements from test-taking artifacts or memorization under evolving public benchmarks.
  • [Arithmetic] Counting and digit-level manipulations: need architectures and positional encodings with provable inductive biases for discrete operations; tokenization confounds remain unresolved.
  • [Arithmetic] Lack of diagnostics that distinguish algorithmic learning from pattern-matching (e.g., last-digit vs first-digit inconsistencies) across scales and languages.
  • [Arithmetic] Overreliance on tool-use masks core internal arithmetic deficits; need tool-free evaluations and training that grow durable internal algorithms.
  • [Temporal/Units] Persistent failures in temporal/units reasoning (e.g., calendars, time zones, unit conversions) lack unified tests and targeted architectural/representation fixes.
  • [MWPs] Insufficient error decomposition for MWPs (parsing vs planning vs computation vs verification) to pinpoint where reasoning breaks and match mitigations accordingly.
  • [MWPs] Limited, reliable detection and handling of unsolvable or ill-posed MWPs; models still “force” solutions—robust abstention criteria and proofs-of-unsolvability are underdeveloped.
  • [MWPs] Automated verification of reasoning chains remains fragile; need benchmarks with trustworthy step-level labels and methods to evaluate faithfulness vs fluency.
  • [Embodied-1D] Physical commonsense: lack of grounding signals and standardized text-only tasks that require real-world regularities beyond language priors (with contamination checks).
  • [Embodied-2D/3D] Sparse, standardized 2D/3D embodied benchmarks spanning spatial relations, affordances, causal dynamics, and planning with sim-to-real validation.
  • [Embodied-2D/3D] No consensus on how to attribute failure to language vs perception vs control in multimodal systems; need modular, ablated pipelines and shared diagnostics.
  • [Embodied-2D/3D] Open questions on data requirements and safety for physical grounding: how much multimodal/RL experience is necessary and how to collect it ethically at scale?
  • [Embodied-2D/3D] Few architectural proposals with formal properties for spatial reasoning (e.g., equivariance, object-centricity) evaluated against the failure taxonomy in this survey.
  • [Mitigation] CoT/RAG/steering methods: limited evidence of durability under perturbations and task transfer; need principled evaluations of “how much” robustness they actually add.
  • [Mitigation] Data augmentation for symmetry/compositionality remains expensive and domain-specific; open need for general, model-agnostic procedures with proven generalization.
  • [Mitigation] Activation steering and circuit editing lack standardized safety checks (distributional side-effects, degradation on non-targeted abilities) and reproducibility guidelines.
  • [Mitigation] Few formal guarantees or bounds linking mitigation strategies to reductions in specific failure probabilities under defined perturbation classes.
  • [Taxonomy] The proposed 2-axis taxonomy (reasoning type × failure type) is not empirically validated across a broad set of models/tasks; need inter-annotator studies and predictive utility tests (e.g., does category predict effective mitigation?).
  • [Synthesis] No quantitative meta-analysis estimating prevalence/severity of each failure mode across models/benchmarks, preventing evidence-based prioritization.
  • [Repository] The curated GitHub list lacks a documented coverage protocol (inclusion/exclusion, periodic audits, redundancy checks), impeding completeness and bias assessment.

Glossary

  • Abstract reasoning: The ability to identify patterns and relationships among concepts to draw generalizable conclusions beyond specific instances. "Another key aspect is abstract reasoning \citep{guinungco2020abstract}, the cognitive ability to recognize patterns and relationships in intangible concepts."
  • Activation steering: A technique that modifies internal neural activations to guide model behavior toward desired outcomes. "Current efforts to address these limitations mainly include prompt-based interventions \citep{chakraborty2025structuredmoralreasoninglanguage, ma2023letsthoughtexperimentusing}, internal activation steering \citep{tlaie2024exploring, turner2023steering}, and direct fine-tuning on curated moral reasoning benchmarks \citep{senthilkumar2024finetuninglanguagemodelsethical, karpov2024inducinghumanlikebiasesmoral}."
  • Affective bias: Systematic influence of emotional content on judgments and reasoning, often leading to skewed outputs. "Furthermore, current models exhibit deficits in emotional reasoning. This includes difficulties in emotional intelligence (EI) \citep{sabour2024emobench, hu2025emobench, amirizaniani2024llms, vzorinab2024emotional}, susceptibility to affective bias \citep{chochlakis2024strong}, and limited understanding of cultural variations in emotional expression and interpretation \citep{havaldar2023multilingual}."
  • Anchoring bias: The tendency to rely heavily on initial information (the “anchor”) when making decisions, even when subsequent information suggests adjustment. "LLMs are highly sensitive to the order in which information is given, exhibiting order bias \citep{koo2023benchmarking, pezeshkpour2023large, jayaram2024mitigating, guan2025order, cobbina2025show}, and show anchoring bias \citep{lieder2018anchoring, rastogi2022deciding}, where early inputs disproportionately shape their reasoning \citep{lou2024anchoring, o2025anchoring, huang2025empirical, wang2025assessing}."
  • Application-specific limitations: Failures tied to particular domains where models underperform relative to human expectations, despite general competence elsewhere. "In contrast, application-specific limitations reflect shortcomings tied to particular domains of importance, where models underperform despite human expectations of competence."
  • Belief tracking: Maintaining and updating representations of other agents’ beliefs or states during interaction or reasoning. "Mitigation research thus targets (i) richer internal models like belief tracking and hypothesis testing \citep{li2023theory,cross2024hypothetical}..."
  • Causal inference: Determining cause-effect relationships from data and reasoning, beyond mere correlation. "Using this approach, studies reveal LLM weaknesses in specific types of logic such as converse binary relations \citep{qi2023investigationllmsinefficacyunderstanding}, syllogistic reasoning \citep{ando2023evaluatinglargelanguagemodels}, causal inference \citep{joshi2024llmspronefallaciescausal}, and even shallow yes/no questions \citep{clark2019boolqexploringsurprisingdifficulty}."
  • Causal masking: The Transformer’s mechanism that restricts attention to prior tokens to ensure autoregressive generation, potentially inducing order-based biases. "Second, architectural features of the model -- such as the Transformer’s causal masking -- introduce predispositions toward order-based biases independent of data \citep{wu2025emergence, dufter2022position}."
  • Chain-of-Thought (CoT): A prompting strategy that elicits step-by-step reasoning in the output to improve problem-solving. "Efforts to enhance these skills correspondingly include advanced prompting like Chain-of-Thought (CoT) \citep{wei2022chain}, retrieval augmentation \citep{xu2023retrieval}, fine-tuning with deliberately injected interference \citep{li2022largelanguagemodelscontrollable}, multimodality \citep{hao2025visual}, and architectural innovations to mimic human attention mechanisms \citep{wu2024cognitive}."
  • Cognitive biases: Systematic deviations from rational judgment caused by heuristics, limited resources, or contextual influences. "Cognitive biases -- systematic deviations from rational judgment -- are well-studied in human reasoning \citep{tversky1974judgment, tversky1981framing}."
  • Cognitive flexibility: The ability to switch between tasks or adapt to new rules efficiently. "Lastly, cognitive flexibility, the skill of adapting to new rules or switching tasks efficiently, remains a challenge, especially in rapid task switching and adaptation to new instructions \citep{kennedy2024cognitive}."
  • Compositional reasoning: Combining multiple pieces of information or sub-results into a coherent overall inference. "Compositional reasoning requires combining multiple pieces of knowledge or arguments into a coherent inference."
  • Confirmation bias: Favoring information that confirms prior beliefs or context, while discounting contradictory evidence. "Regarding content, LLMs struggle more with abstract or unfamiliar topics -- a phenomenon known as the “content effect” \citep{lampinen2024language} -- and tend to favor information that aligns with prior context or assumptions, reflecting human-like confirmation bias \citep{o2025confirmation, shi2024argumentative, malberg2024comprehensive, wan2025unveilingconfirmationbiaschainofthought, zhu2024conformity}."
  • Context engineering: Systematically optimizing the entire input payload (context) provided to an LLM during inference for robustness and performance. "In parallel, the recent rise of context engineering \citep{mei2025survey} -- which focuses on a systematic optimization of the entire information payload fed to an LLM during inference -- is increasingly seen as a more robust alternative to traditional prompt engineering in MAS."
  • Counting: Discrete enumeration of elements; a basic numerical capability that current LLMs often fail to perform reliably. "Despite its simplicity, counting poses a notable fundamental challenge for LLMs \citep{xu2024llmgeniusparadoxlinguistic, chang2024languagemodelsneedinductive, zhang-he-2024-large, fu2024largelanguagemodelsllms, conde2025can, yehudai2024transformerscountn}, even the reasoning ones \citep{malek2025frontierllmsstrugglesimple}..."
  • Embodied reasoning: Reasoning grounded in physical interaction with environments, relying on spatial intelligence and real-time feedback. "Embodied reasoning depends on physical interaction with environments, fundamentally relying on spatial intelligence and real-time feedback \citep{shapiro2019embodied, barsalou2008grounded}."
  • Emotional Intelligence (EI): The capability to perceive, understand, and manage emotions in oneself and others. "Furthermore, current models exhibit deficits in emotional reasoning. This includes difficulties in emotional intelligence (EI) \citep{sabour2024emobench, hu2025emobench, amirizaniani2024llms, vzorinab2024emotional}..."
  • Executive functions: Core cognitive processes—working memory, inhibitory control, and cognitive flexibility—essential for goal-directed reasoning. "A prominent example is the set of core executive functions -- working memory \citep{baddeley2020working}, inhibitory control \citep{diamond2013executive,williams1999development}, and cognitive flexibility \citep{canas2006cognitive} -- essential in human reasoning \citep{diamond2013executive}."
  • False belief tasks: Assessments of Theory of Mind where one must reason about others holding beliefs that are incorrect. "Typically emerging in early childhood with milestones like passing false belief tasks (understand that others' beliefs may be incorrect or different) \citep{wimmer1983beliefs}..."
  • Framing effects: Variations in decisions or outputs caused by differences in how equivalent information is presented. "Framing effects further influence outputs: logically equivalent but differently phrased prompts can lead to different results \citep{jones2022capturing, suri2024large, nguyen2024human, lior2025wildframe, robinson2025framing, shafiei2025more}."
  • Fundamental failures: Model shortcomings intrinsic to architectures or training that broadly affect diverse tasks. "Fundamental failures are usually intrinsic to LLM architectures, manifesting broadly and universally across diverse downstream tasks."
  • Graph-structured reasoning path data: Training data that encodes reasoning steps as graph structures to guide logical inference. "while \citet{zhou2024enhancinglogicalreasoninglarge} enhances training with graph-structured reasoning path data, similar to distilling CoT reasoning process into training data \citep{yu2024distilling21}."
  • Group attribution bias: The tendency to ascribe characteristics or outcomes to group membership rather than individual actions. "Social cognitive biases also influence LLM outputs, including group attribution bias \citep{hamilton1976illusory, allison1985group, raj2025talent} and negativity bias \citep{rozin2001negativity}..."
  • Inhibitory control: The ability to suppress impulsive or default responses when inappropriate. "Inhibitory control -- the ability to suppress impulsive or default responses when contexts demand -- is also weak in LLMs, with them often sticking to previously learned patterns even when contexts shift \citep{han2024incontextlearningelicittrustworthy, patel2025deficient}..."
  • Large Reasoning Models (LRMs): Advanced LLMs optimized or specialized for complex reasoning tasks. "Beyond perturbations, a rising approach utilizes meta-theorems such as the Monadic Second-Order logic from CS theory to synthesize algorithmic coding problems at scale \citep{beniamini2025formulaonemeasuringdepthalgorithmic}, posing a significant challenge even for state-of-the-art large reasoning models (LRMs) \citep{xu2025largereasoningmodelssurvey}."
  • Logic-preserving transformations: Modifications to benchmark items that change surface form while keeping the underlying logical task identical. "these studies introduce logic-preserving transformations based on particular task structures, such as reordering options in multiple-choice questions (MCQs) \citep{zheng2023large, pezeshkpour2023large, alzahrani2024benchmarks, gupta2024changinganswerorderdecrease, ni2024trainingbenchmarkneed}..."
  • Math Word Problems (MWPs): Natural-language math tasks requiring translation into formal operations and multi-step reasoning. "MWPs exemplify this, as their logic can be readily abstracted into reusable templates."
  • Monadic Second-Order logic: A powerful logical framework allowing quantification over sets, used to formally describe and synthesize algorithmic problems. "Beyond perturbations, a rising approach utilizes meta-theorems such as the Monadic Second-Order logic from CS theory to synthesize algorithmic coding problems at scale \citep{beniamini2025formulaonemeasuringdepthalgorithmic}..."
  • Multi-Agent Systems (MAS): Systems composed of interacting agents that coordinate to solve tasks via communication and planning. "In Multi-Agent Systems (MAS), explicit social reasoning is the capacity of AI systems to collaboratively plan and solve complex tasks, an area challenging for current LLMs."
  • Multi-Head Self-Attention (MHSA): A Transformer mechanism with multiple attention heads enabling parallel focus on different token relations. "Toward this, \citet{li2024understandingpatchingcompositionalreasoning} identifies faulty implicit reasoning in mid-layer multi-head self-attention (MHSA) modules and edit them..."
  • Negativity bias: The tendency to weigh negative information more heavily than positive information in judgment. "Social cognitive biases also influence LLM outputs, including group attribution bias \citep{hamilton1976illusory, allison1985group, raj2025talent} and negativity bias \citep{rozin2001negativity}..."
  • Neuro-symbolic augmentations: Methods that integrate neural networks with symbolic components to enable exact or structured reasoning. "and neuro-symbolic augmentations that enable internal arithmetic reasoning \citep{dugan2024occamllmfastexactlanguage}."
  • Non-embodied reasoning: Cognitive processes that do not require physical interaction with environments. "Non-embodied reasoning comprises cognitive processes not requiring physical interaction with environments."
  • Object affordance: The possible actions or interactions that an object allows, given its properties and context. "Failures of LLMs include lack of knowledge about object attributes (e.g., size, weight, softness) \citep{wang2023newtonlargelanguagemodels, liu2022things, shu2023posqa, Kondo_Sugawara_Aizawa_2023}, spatial relationships (e.g., above, inside, next to) \citep{liu2022things, shu2023posqa, Kondo_Sugawara_Aizawa_2023}, simple physical laws (e.g., gravity, motion, and force) \citep{Gregorcic_Pendrill_2023}, and object affordance (possible actions/reactions an object can make) \citep{Aroca-Ouellette_Paik_Roncone_Kann_2021, adak2024text2afford, pensa2024multi}."
  • Order bias: Sensitivity of outputs to the sequence in which information or options are presented. "LLMs are highly sensitive to the order in which information is given, exhibiting order bias \citep{koo2023benchmarking, pezeshkpour2023large, jayaram2024mitigating, guan2025order, cobbina2025show}..."
  • Positional encoding: A technique in Transformers that injects token position information into embeddings to enable order-aware attention. "Although the failures manifest at the application level, much work suggest that they originate primarily from architectural and representational limits, including tokenization \citep{zhang2024countingabilitylargelanguage, shin2024largelanguagemodelslack}, positional encoding \citep{chang2024languagemodelsneedinductive}, and training data composition \citep{allenzhu2024physicslanguagemodels31}..."
  • Proactive interference: The phenomenon where earlier information disrupts the learning or retrieval of newer information. "In particular, LLMs suffer from ``proactive interference'' to a much larger extent than humans, where earlier information significantly disrupts retrieval of newer updates \citep{wang2025unableforgetproactivelnterference}."
  • Reinforcement Learning from Human Feedback (RLHF): A training process where models are optimized to produce outputs preferred by human raters. "Third, alignment processes like Reinforcement Learning from Human Feedback (RLHF) amplify biases by aligning model behavior with human raters who are themselves biased \citep{sumita2025cognitive, perez2023discovering}."
  • Retrieval augmentation: Enhancing model reasoning by fetching and incorporating relevant external information during inference. "Efforts to enhance these skills correspondingly include advanced prompting like Chain-of-Thought (CoT) \citep{wei2022chain}, retrieval augmentation \citep{xu2023retrieval}..."
  • Reversal curse: A failure where models trained on “A is B” do not infer the logically equivalent “B is A.” "A representative failure of LLMs is reversal curse: despite being trained on A is B,'' models often fail to infer the equivalentB is A'' -- a trivial bidirectional equivalence for humans."
  • Robustness vulnerabilities: Weaknesses where model performance degrades under minor, logic-preserving variations in inputs or tasks. "While there has been clear progress from early models like GPT-3 -- which largely failed at ToM tasks -- to newer models such as GPT-4o and reasoning models like o1-mini, which can solve many standard ToM tests, their underlying reasoning remains brittle under simple perturbations \citep{gu2024simpletom, zhou2023far}... such failures often result in significant robustness vulnerabilities."
  • Syllogistic reasoning: Deductive reasoning using structured premises (syllogisms) to derive logical conclusions. "Using this approach, studies reveal LLM weaknesses in specific types of logic such as converse binary relations \citep{qi2023investigationllmsinefficacyunderstanding}, syllogistic reasoning \citep{ando2023evaluatinglargelanguagemodels}..."
  • Theory of Mind (ToM): The capacity to attribute and reason about beliefs, intentions, and emotions of oneself and others. "ToM is the cognitive ability to attribute mental states -- beliefs, intentions, emotions -- to oneself and others, and to understand that others' mental states may differ from one's own \citep{frith2005theory}."
  • Tokenization: Segmenting text into tokens for model processing, which influences representation and downstream abilities. "Although the failures manifest at the application level, much work suggest that they originate primarily from architectural and representational limits, including tokenization \citep{zhang2024countingabilitylargelanguage, shin2024largelanguagemodelslack}, positional encoding \citep{chang2024languagemodelsneedinductive}, and training data composition \citep{allenzhu2024physicslanguagemodels31}..."
  • Two-hop reasoning: Inferring conclusions by combining exactly two related facts, often across documents or contexts. "Studies show systematic failures in basic two-hop reasoning -- combining only two facts across documents -- and even worsening performance with increased compositional depth and the addition of distractors \citep{zhao2024exploringlimitationslargelanguage, xu2024largelanguagemodelscompositional, guo2025llmsperformtwohopreasoning}."
  • Zipf's law: A statistical distribution where the frequency of items is inversely proportional to their rank, impacting data coverage and learning. "\citet{golovneva2024reversetrainingnursereversal} further argues that scaling alone cannot resolve the issue due to Zipf's law \citep{Newman_2005}."

Practical Applications

Immediate Applications

  • Robustness-driven LLM audits for deployment (software, healthcare, finance, legal, customer support)
    • Use the survey’s taxonomy and logic-preserving perturbations (e.g., option reordering, premise/event rearrangement, harmless entity swaps) to stress-test models before release, catching fundamental, application-specific, and robustness failures; Tools/Workflows: metamorphic test generators for math/coding/QA, CI/CD gates that fail on instability, private/live internal benchmarks; Assumptions/Dependencies: access to representative task data, ability to run large test suites, guarding against benchmark contamination.
  • Prompt and context engineering “linters” to reduce cognitive biases (software, education, enterprise productivity)
    • Encode known bias risks (anchoring, order, framing, verbosity, distractions) into automated checks and recommendations that restructure prompts/contexts and enforce verification steps (e.g., Chain-of-Thought, self-critique); Tools/Workflows: prompt-linting plugins, context optimizers, structured templates with rationale and cross-checks; Assumptions/Dependencies: model supports longer contexts/rationales, cost/latency budget for deliberation.
  • Human-in-the-loop routing for weak-reasoning regimes (science, education, enterprise analytics)
    • Based on known deficits (counting, arithmetic, physics reasoning, ToM, moral consistency), route sub-tasks to tools/humans: calculators/CAS for math, retrieval for factual gaps, policy checklists for ethics; Tools/Workflows: orchestration layer that classifies queries by failure category and routes to tools/approvers; Assumptions/Dependencies: tool availability, integration infrastructure, clear escalation criteria.
  • Multi-agent safety scaffolding (software automation, operations)
    • Reduce MAS failures via structured roles, belief-tracking, mandatory verification rounds, inspector/challenger agents, and explicit termination checks; Tools/Workflows: reusable MAS protocols, “watchdog” agents, conversation state graphs; Assumptions/Dependencies: tolerance for added latency/cost, careful system design to avoid error cascades.
  • Moral/ethics and social-norm consistency checks (platform policy, HR, healthcare triage, legal)
    • Apply standardized ToM/social-norm/moral benchmarks with paraphrase/language variations to detect inconsistencies and jailbreak susceptibility; Tools/Workflows: compliance checklists mapped to the survey’s taxonomy, red-teaming suites seeded with known failure patterns; Assumptions/Dependencies: organizational policy clarity, multilingual evaluation capacity, reviewer training.
  • Safer model procurement and vendor evaluation (policy, enterprise IT governance)
    • Require vendors to report robustness under perturbations, reversal curse tests, compositional reasoning checks, and bias audits; Tools/Workflows: RFP templates and model cards that include failure profiles and mitigation plans; Assumptions/Dependencies: willingness of vendors to disclose, standardized test protocols.
  • Education and AI literacy modules (academia, K–12, professional training)
    • Use curated examples (e.g., reversal curse, framing effects, counting errors) to teach limitations and safe usage patterns; Tools/Workflows: classroom kits, lab assignments employing perturbation-based evaluations and CoT; Assumptions/Dependencies: access to models or recorded outputs, instructor preparedness.
  • Content moderation and communications hygiene (trust & safety, marketing, newsrooms)
    • Counteract negativity/popularity biases and confirmation bias via prompt steering and two-pass generation (neutral reframe + fact-check pass); Tools/Workflows: activation/post-processing steering, policy-tuned prompts, reviewer dashboards that flag bias-inducing frames; Assumptions/Dependencies: model supports steerability, clear editorial policies.
  • Coding and math evaluation hardening (software engineering, online judges, edtech)
    • Adopt docstring/identifier perturbations, control-flow swaps, numeric resampling, and solvability checks to prevent overfitting and assess true reasoning; Tools/Workflows: robustness harnesses for code (renaming, syntax edits, adversarial mutations) and math (entity swaps, blank-inversions, variable roles); Assumptions/Dependencies: execution sandboxes, contamination-free datasets.
  • Product UX for “deliberate mode” and uncertainty surfacing (software products, consumer apps)
    • Add UI options for explicit reasoning steps, alternative framings, and automatic rechecks under perturbed prompts; Tools/Workflows: CoT toggles, multi-sample agreement, result stability meters; Assumptions/Dependencies: user tolerance for latency, support for rationale generation.
  • Domain-specific data curation for debiasing and symmetry (model devs, MLOps)
    • Integrate bidirectional fact variants to mitigate reversal curse, graph-structured reasoning paths for compositionality, and adversarial/perturbed samples for robustness; Tools/Workflows: data pipelines that auto-generate reversals, semantic unit permutations, and graph-labeled examples; Assumptions/Dependencies: data rights, compute for fine-tuning, evaluation against leakage.
  • Operations checklists for daily use of chatbots (daily life, SMBs)
    • Practical habits: avoid leading questions, test answer stability via minor rephrasings, verify math with calculators, and ask the model to list assumptions; Tools/Workflows: lightweight browser extensions that auto-perturb prompts and compare answers; Assumptions/Dependencies: user consent to extra steps, minimal UI friction.

Long-Term Applications

  • Architecture-level improvements for core cognitive skills (AI R&D, foundational model labs)
    • Design models with enhanced working memory, inhibitory control, and cognitive flexibility (e.g., new attention mechanisms, memory modules) and improved positional/token representations to fix counting/arithmetic; Tools/Workflows: hybrid attention/memory blocks, alternative positional encoding, curriculum learning; Assumptions/Dependencies: significant research cycles, training compute, rigorous ablation frameworks.
  • Bidirectional/contrastive training regimes to overcome directional asymmetries (model training)
    • Systematic objectives that enforce logical symmetry (A→B and B→A) and compositional consistency across tasks; Tools/Workflows: contrastive data pairs, loss terms for symmetry/compositional constraints, verification-by-edit objectives; Assumptions/Dependencies: large-scale curated data, risk of catastrophic interference, careful evaluation to avoid shortcut learning.
  • Neuro-symbolic reasoning modules “under the hood” (software, finance, engineering, science)
    • Integrate exact solvers (arithmetic engines, theorem provers, program analyzers) as internal components rather than external calls, enabling reliable reasoning where LLMs are weak; Tools/Workflows: differentiable interfaces to solvers, internal tool-selection policies, caching of verified sub-results; Assumptions/Dependencies: latency/complexity overheads, compatibility with training/inference stacks.
  • Embodied and multimodal grounding for physical commonsense (robotics, AR/VR, logistics)
    • Train models with sensorimotor traces, physics simulators, and vision-language-robotics curricula to reduce physical reasoning gaps; Tools/Workflows: simulation-to-real pipelines, multimodal datasets with affordance labels, RL with language supervision; Assumptions/Dependencies: high-fidelity simulators, safety constraints, domain transfer challenges.
  • Standardized societal alignment batteries and governance (policy, regulators, standards bodies)
    • Sector-wide certifications that test ToM robustness, social norm diversity, and moral consistency under paraphrase/multilingual shifts; Tools/Workflows: certification suites maintained as private/live benchmarks, third-party auditors, reporting standards; Assumptions/Dependencies: consensus on norms/metrics, handling cultural variance, legal frameworks.
  • Resilient multi-agent infrastructures with formal guarantees (enterprise automation, defense, autonomous systems)
    • MAS architectures with formalized protocols, belief/goal modeling, verification phases, and termination proofs; Tools/Workflows: protocol DSLs, runtime monitors, theorem-proved coordination templates, “defense-in-depth” agents; Assumptions/Dependencies: formal methods expertise, performance trade-offs, adversarial robustness requirements.
  • Continually evolving evaluation ecosystems (academia, benchmarking firms)
    • Live/private benchmark infrastructures that auto-generate logic-preserving variants and harder instances (e.g., algorithmic problem synthesis via meta-theorems); Tools/Workflows: dynamic test generation services, contamination tracking, longitudinal scorecards; Assumptions/Dependencies: funding and stewardship, secure data handling, community adoption.
  • Cultural and affective competence modeling (global platforms, healthcare, education)
    • Richer models of emotion and culture-aware interpretations to mitigate EI deficits and cross-cultural misalignments; Tools/Workflows: multilingual/cultural datasets, persona-conditioned training, affective grounding from multimodal signals; Assumptions/Dependencies: ethical data sourcing, bias and privacy safeguards, risk of stereotyping.
  • Organizational MLOps for failure-aware lifecycles (enterprise AI)
    • End-to-end pipelines that tag features by failure category, enforce pre-deployment robustness gates, monitor post-deployment drift in reasoning stability, and auto-trigger retraining with targeted perturbations; Tools/Workflows: taxonomy-aware feature registries, drift detectors for instability, automated dataset augmentation; Assumptions/Dependencies: mature MLOps stack, cross-functional governance.
  • Public AI literacy and consumer protections (policy, civil society)
    • Long-term programs that codify best practices for end-users, require plain-language disclosure of model limitations, and provide recourse when reasoning failures cause harm; Tools/Workflows: consumer labels, complaint handling standards, educational curricula; Assumptions/Dependencies: legislative support, coordination with industry, measurement of harm/benefit.
  • Domain-specific “reasoning skeletons” and toolchains (engineering, finance, medicine, law)
    • Pre-built, auditable reasoning workflows that decompose tasks into validated steps, bind critical steps to tools/humans, and log rationales for compliance; Tools/Workflows: sectoral templates (e.g., clinical differential diagnosis with calculator and guideline checks; compliance reviews with rule libraries); Assumptions/Dependencies: regulatory acceptance, interoperability with legacy systems, traceability requirements.
  • Research and teaching infrastructure built on the survey’s repository (academia)
    • Expand the “Awesome LLM Reasoning Failures” resource into course modules, shared datasets, and open leaderboards focusing on failure analysis and mitigation; Tools/Workflows: curated corpora of failure cases, standardized reporting formats, reproducible baselines; Assumptions/Dependencies: sustained community contributions, funding, and maintenance.

Notes on feasibility across applications

  • Many immediate applications rely on black-box access to models and can be added as wrappers, evaluators, or orchestration layers; costs are primarily engineering and compute for testing.
  • Training-centric mitigations (debiasing, symmetry data, architectural changes) depend on data rights, large compute budgets, and rigorous evaluation to avoid regressions.
  • Multi-agent and embodied applications introduce latency, complexity, and safety concerns; formal verification and simulation fidelity are key dependencies.
  • Ethical, cultural, and policy-aligned applications require consensus on norms, multilingual coverage, and safeguards against reinforcing biases.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 85 tweets with 165435 likes about this paper.

HackerNews

  1. Large Language Model Reasoning Failures (40 points, 82 comments) 
  2. Large Language Model Reasoning Failures (3 points, 0 comments) 
  3. Large Language Model Reasoning Failures (1 point, 0 comments) 
  4. LLM Reasoning Failures (1 point, 0 comments) 
  5. Large Language Model Reasoning Failures (1 point, 0 comments)