Large Language Model Reasoning Failures
Abstract: LLMs have exhibited remarkable reasoning capabilities, achieving impressive results across a wide range of tasks. Despite these advances, significant reasoning failures persist, occurring even in seemingly simple scenarios. To systematically understand and address these shortcomings, we present the first comprehensive survey dedicated to reasoning failures in LLMs. We introduce a novel categorization framework that distinguishes reasoning into embodied and non-embodied types, with the latter further subdivided into informal (intuitive) and formal (logical) reasoning. In parallel, we classify reasoning failures along a complementary axis into three types: fundamental failures intrinsic to LLM architectures that broadly affect downstream tasks; application-specific limitations that manifest in particular domains; and robustness issues characterized by inconsistent performance across minor variations. For each reasoning failure, we provide a clear definition, analyze existing studies, explore root causes, and present mitigation strategies. By unifying fragmented research efforts, our survey provides a structured perspective on systemic weaknesses in LLM reasoning, offering valuable insights and guiding future research towards building stronger, more reliable, and robust reasoning capabilities. We additionally release a comprehensive collection of research works on LLM reasoning failures, as a GitHub repository at https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures, to provide an easy entry point to this area.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper looks at how LLMs — like the AI that writes, answers questions, or helps with homework — still make surprising reasoning mistakes, even on problems that seem easy to people. The authors collect and organize many studies about these failures, explain why they happen, and discuss what might help fix them. They also share a public list of related research on GitHub to help others explore this topic.
What questions does it ask?
The paper tackles three simple questions:
- What kinds of reasoning do LLMs struggle with?
- What kinds of failures do they have, and how are these failures different from each other?
- Why do these failures happen, and what can we do to reduce them?
To answer these, the authors propose two helpful “maps”:
- A reasoning map: reasoning can be non-embodied (done in the head, like logic and intuition) or embodied (about the physical world and actions). Non-embodied splits into informal (everyday intuition) and formal (logic, math, code).
- A failure map: failures can be fundamental (built-in limits of the model), application‑specific (problems in certain domains like social or moral questions), or robustness issues (answers change when small, irrelevant details change).
How did the authors study it?
This is a survey paper. That means the authors:
- Read and compared many research papers about LLM reasoning errors.
- Built a clear categorization framework (the “maps” above) to organize all the different failure types.
- For each failure type, they describe examples, likely causes, and known fixes.
- Collected links to papers and benchmarks in a GitHub repository for easy access.
Think of it like a guidebook that pulls together scattered evidence and shows the bigger picture.
What did they find?
Here are the main takeaways, grouped by the type of reasoning.
1) Informal reasoning (everyday thinking)
- Core mental skills are shaky:
- Limited “working memory”: models lose track of details over long problems or when the situation changes midway.
- Weak “inhibitory control”: they stick to obvious patterns even when those patterns don’t fit the new context.
- Low “cognitive flexibility”: they struggle to switch rules quickly or adapt to new instructions.
- Abstract reasoning is inconsistent: finding hidden rules or dealing with time and symbols can go wrong.
- Human-like cognitive biases show up:
- Order and anchoring effects: the first thing you tell the model can overly influence its answer.
- Framing effects: the same question phrased differently can lead to different answers.
- Confirmation and popularity biases: they prefer info that matches earlier context or what’s common online.
- These biases come from training data, model design (like how Transformers read sequences), and human feedback processes.
- Social intuition (without interaction) is brittle:
- Theory of Mind (imagining what someone else knows, believes, or feels) is inconsistent. Even small wording changes can make models fail basic tests that young children pass.
- Moral and social norms reasoning is unstable: answers can be contradictory across similar questions, languages, or cultures. This makes safety-sensitive uses risky.
- Social reasoning with multiple agents (teams of AIs):
- Long-term planning breaks down; communication can be unclear; models misread other agents’ goals.
- Systems are vulnerable to small disruptions or bad actors unless carefully designed with checks, roles, and verification.
Why this happens: Text-only training (no real-world experience), next-token prediction goals (which reward pattern completion over deep thinking), attention limits, biased data, and alignment that can reinforce human biases.
What helps (partially): Chain-of-Thought prompting, better retrieval, fine-tuning, curated data, adversarial training, personality steering, stronger multi-agent protocols, and “context engineering.” These help but don’t fully solve the problems.
2) Formal reasoning (logic, math, and code)
- Logic in natural language:
- Reversal curse: a model that learns “A is B” may fail to infer “B is A” (for example, “Paris is the capital of France” vs. “The capital of France is Paris”). This reflects directional training and asymmetries in how models learn.
- Compositional reasoning: models can handle simple parts but fail when combining them (even just two steps). This shows up in two-hop reasoning, multi-fact checks, and compound math tasks.
- Logic hidden in benchmarks:
- Small, logic-preserving changes should not affect answers, but they do. Examples: reordering multiple-choice answers, renaming variables in code, or swapping harmless details in math problems. Big score drops after such changes reveal robustness issues and possible overfitting to public datasets.
- Arithmetic and mathematics:
- Counting is harder than it seems for LLMs; tokenization and position encoding play a role.
- Basic arithmetic breaks under more digits, especially in multiplication. Models often use shortcuts instead of real algorithms.
- Math Word Problems (MWPs) combine reading with math; models can be thrown off by tiny edits, swapped known/unknowns, or unsolvable trick questions.
Why this happens: The model’s job is to predict the next word, not to “think” in steps like humans do. This encourages pattern-matching, not reliable rules. Architecture choices and training data design matter a lot.
What helps (partially): Training with step-by-step solutions, reversing digit order to focus attention where the math actually happens, editing model internals, adding graph-like reasoning paths, using external tools or neuro-symbolic modules, and improving benchmarks to be contamination-free and transformation-resistant.
3) Embodied reasoning (the physical world)
- Text-only physical commonsense:
- Models often misunderstand object properties (size, weight), spatial relations (above, inside), and simple physical laws (gravity, force).
- Formal physics or scientific problem-solving shows gaps in multi-step reasoning and correct law use.
- Why this is hard: LLMs learn from text, not from touching or seeing the world. Without real grounding, physical intuition is fragile.
- What helps: Multimodal inputs (vision, audio), simulations, better scientific training data, and tools can help, but full physical understanding remains a major open challenge.
What is the impact of this work?
- It gives researchers a shared language and structure to talk about LLM reasoning failures. That makes it easier to spot common patterns, compare results, and build stronger fixes.
- It warns users and builders not to trust single benchmark scores. Robustness — staying right when the problem is phrased differently — matters.
- It highlights where today’s models are least reliable: long, multi-step plans; subtle social or moral questions; small logic changes; counting and arithmetic; and anything needing physical intuition.
- It points to promising directions: better data, smarter training, model architecture changes, multimodal grounding, system-level safeguards, and stronger, evolving benchmarks.
If you want to explore the cited works, the authors provide a curated GitHub list: https://github.com/Peiyang-Song/Awesome-LLM-Reasoning-Failures
Final takeaway
LLMs can do amazing things, but they still make predictable reasoning mistakes — sometimes on very simple tasks, and sometimes just because a question was reworded. This paper maps out those weaknesses, explains why they happen, and gathers the best ideas for fixing them. The big message: to build AI you can rely on, you must test for robustness, understand failure modes, and design models and systems that think more carefully, not just more loudly.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of concrete gaps and open questions the paper leaves unresolved, organized via tags to aid actionability and prioritization.
- [Methodology] Lack of standardized, quantitative definitions and metrics for “reasoning failure” types (fundamental vs application-specific vs robustness), making cross-paper comparisons and meta-analysis difficult.
- [Methodology] No unified evaluation protocol to disentangle causality of failures (data artifacts vs architecture vs alignment/RLHF vs inference-time prompt/context), limiting targeted mitigation.
- [Methodology] Absence of longitudinal tracking across model updates (e.g., model/version churn) to determine which failures persist, attenuate, or emerge over time.
- [Methodology] Insufficient contamination auditing and reporting standards for benchmarks used to diagnose failures, leaving robustness and “true generalization” ambiguous.
- [Methodology] No community-maintained, perturbation-based evaluation suite with standardized logic-preserving edits across tasks (math, code, commonsense, ToM, etc.) to stress-test stability.
- [Methodology] Limited mechanistic/causal evidence linking specific internal components (layers, heads, circuits) to each failure mode; most claims are correlational.
- [Methodology] Missing analyses of trade-offs (e.g., debiasing vs helpfulness, safety vs robustness, CoT vs latency/cost) when mitigating one failure potentially worsens others.
- [Cognitive] No agreed-upon, task-agnostic probes for working memory, inhibitory control, and cognitive flexibility in LLMs that allow consistent scoring across model families and sizes.
- [Cognitive] Under-specified links between architectural choices (e.g., attention variants, recurrence, external memory) and improvements in executive functions; few controlled ablations.
- [Cognitive] Lack of curricula or training signals explicitly designed to grow abstract reasoning and temporal abstraction in a measurable, transferable way.
- [Bias] Insufficient disentanglement of bias sources: how much arises from pretraining corpus vs positional encodings/causal masking vs RLHF/finetuning vs decoding strategies?
- [Bias] Poor coverage of cross-lingual/cross-domain cognitive biases (e.g., anchoring, framing) and how they differ with script/tokenization schemes.
- [Bias] No standardized stress-tests for bias persistence and “rebound” under context shifts, jailbreaks, or adversarially framed prompts.
- [Bias] Minimal mechanistic interpretability on how and where framing/order/anchoring biases are represented and can be reliably edited without collateral damage.
- [ToM] Limited evidence that ToM-like abilities survive small perturbations and transfer to interactive, dynamic multi-turn settings; need for benchmarks that jointly test stability and generalization.
- [ToM] Open question whether improved ToM from prompting (e.g., CoT) reflects genuine internal modeling or surface-level decomposition; few tests isolate “latent” ToM.
- [Emotion/EI] Sparse evaluation of emotional intelligence across cultures and modalities (text, audio, vision), and poor grounding of affect recognition to behaviorally consistent outputs.
- [Norms/Morals] No principled framework for learning/calibrating culturally sensitive moral norms that remain consistent across tasks and languages without exacerbating bias.
- [Norms/Morals] Lack of mitigation approaches that yield robust, generalizable moral reasoning beyond task- or dataset-specific gains (RLHF-like “surface compliance” remains brittle).
- [Norms/Morals] Need for adversarial robustness evaluations specifically tailored to moral/ethical queries (e.g., norm ambiguities, trap questions, oblique jailbreaks).
- [MAS] Missing standardized, reproducible long-horizon multi-agent benchmarks with clear termination, verification, and adversarial-agent protocols.
- [MAS] Unclear best practices for internal belief-tracking and hypothesis-testing in agents (what representations/mechanisms actually improve coordination vs noise?).
- [MAS] Insufficient evidence that context-engineering, inspectors/critics, or role-structuring generalize across domains without heavy hand-engineering.
- [MAS] No formal safety guarantees or theoretical analyses for failure cascades in MAS (e.g., bounds on escalation, convergence, or recovery under perturbations).
- [Logic-NL] Reversal curse: open questions around universality across families (decoder-only vs encoder-decoder vs recurrent hybrids) and whether bidirectional objectives fully resolve it at scale.
- [Logic-NL] Limited theory on when and why reversal-sensitive facts remain asymmetric after anti-causal/bidirectional pretraining or post-hoc finetuning.
- [Logic-NL] Compositional reasoning: lack of validated measures for “latent compositionality” (vs CoT-style explicit steps) and how to train it reliably without large compute.
- [Logic-NL] Few studies map specific attention/circuit patterns to composition failures and demonstrate stable, cross-dataset patching/editing that endures distribution shift.
- [Logic-Specific] Limited, comprehensive coverage of targeted logical relations (e.g., quantifiers, modal logic, counterfactuals) with contamination-free datasets and cross-lingual variants.
- [Benchmarks] No widely adopted “meta-benchmark” that applies uniform, logic-preserving transformations to popular tasks and reports a consolidated robustness score.
- [Benchmarks] Sparse evaluation of sensitivity to prompt verbosity, narrative perspective, and distractors under controlled, factorial designs.
- [Benchmarks] Limited assessment frameworks to separate real reasoning improvements from test-taking artifacts or memorization under evolving public benchmarks.
- [Arithmetic] Counting and digit-level manipulations: need architectures and positional encodings with provable inductive biases for discrete operations; tokenization confounds remain unresolved.
- [Arithmetic] Lack of diagnostics that distinguish algorithmic learning from pattern-matching (e.g., last-digit vs first-digit inconsistencies) across scales and languages.
- [Arithmetic] Overreliance on tool-use masks core internal arithmetic deficits; need tool-free evaluations and training that grow durable internal algorithms.
- [Temporal/Units] Persistent failures in temporal/units reasoning (e.g., calendars, time zones, unit conversions) lack unified tests and targeted architectural/representation fixes.
- [MWPs] Insufficient error decomposition for MWPs (parsing vs planning vs computation vs verification) to pinpoint where reasoning breaks and match mitigations accordingly.
- [MWPs] Limited, reliable detection and handling of unsolvable or ill-posed MWPs; models still “force” solutions—robust abstention criteria and proofs-of-unsolvability are underdeveloped.
- [MWPs] Automated verification of reasoning chains remains fragile; need benchmarks with trustworthy step-level labels and methods to evaluate faithfulness vs fluency.
- [Embodied-1D] Physical commonsense: lack of grounding signals and standardized text-only tasks that require real-world regularities beyond language priors (with contamination checks).
- [Embodied-2D/3D] Sparse, standardized 2D/3D embodied benchmarks spanning spatial relations, affordances, causal dynamics, and planning with sim-to-real validation.
- [Embodied-2D/3D] No consensus on how to attribute failure to language vs perception vs control in multimodal systems; need modular, ablated pipelines and shared diagnostics.
- [Embodied-2D/3D] Open questions on data requirements and safety for physical grounding: how much multimodal/RL experience is necessary and how to collect it ethically at scale?
- [Embodied-2D/3D] Few architectural proposals with formal properties for spatial reasoning (e.g., equivariance, object-centricity) evaluated against the failure taxonomy in this survey.
- [Mitigation] CoT/RAG/steering methods: limited evidence of durability under perturbations and task transfer; need principled evaluations of “how much” robustness they actually add.
- [Mitigation] Data augmentation for symmetry/compositionality remains expensive and domain-specific; open need for general, model-agnostic procedures with proven generalization.
- [Mitigation] Activation steering and circuit editing lack standardized safety checks (distributional side-effects, degradation on non-targeted abilities) and reproducibility guidelines.
- [Mitigation] Few formal guarantees or bounds linking mitigation strategies to reductions in specific failure probabilities under defined perturbation classes.
- [Taxonomy] The proposed 2-axis taxonomy (reasoning type × failure type) is not empirically validated across a broad set of models/tasks; need inter-annotator studies and predictive utility tests (e.g., does category predict effective mitigation?).
- [Synthesis] No quantitative meta-analysis estimating prevalence/severity of each failure mode across models/benchmarks, preventing evidence-based prioritization.
- [Repository] The curated GitHub list lacks a documented coverage protocol (inclusion/exclusion, periodic audits, redundancy checks), impeding completeness and bias assessment.
Glossary
- Abstract reasoning: The ability to identify patterns and relationships among concepts to draw generalizable conclusions beyond specific instances. "Another key aspect is abstract reasoning \citep{guinungco2020abstract}, the cognitive ability to recognize patterns and relationships in intangible concepts."
- Activation steering: A technique that modifies internal neural activations to guide model behavior toward desired outcomes. "Current efforts to address these limitations mainly include prompt-based interventions \citep{chakraborty2025structuredmoralreasoninglanguage, ma2023letsthoughtexperimentusing}, internal activation steering \citep{tlaie2024exploring, turner2023steering}, and direct fine-tuning on curated moral reasoning benchmarks \citep{senthilkumar2024finetuninglanguagemodelsethical, karpov2024inducinghumanlikebiasesmoral}."
- Affective bias: Systematic influence of emotional content on judgments and reasoning, often leading to skewed outputs. "Furthermore, current models exhibit deficits in emotional reasoning. This includes difficulties in emotional intelligence (EI) \citep{sabour2024emobench, hu2025emobench, amirizaniani2024llms, vzorinab2024emotional}, susceptibility to affective bias \citep{chochlakis2024strong}, and limited understanding of cultural variations in emotional expression and interpretation \citep{havaldar2023multilingual}."
- Anchoring bias: The tendency to rely heavily on initial information (the “anchor”) when making decisions, even when subsequent information suggests adjustment. "LLMs are highly sensitive to the order in which information is given, exhibiting order bias \citep{koo2023benchmarking, pezeshkpour2023large, jayaram2024mitigating, guan2025order, cobbina2025show}, and show anchoring bias \citep{lieder2018anchoring, rastogi2022deciding}, where early inputs disproportionately shape their reasoning \citep{lou2024anchoring, o2025anchoring, huang2025empirical, wang2025assessing}."
- Application-specific limitations: Failures tied to particular domains where models underperform relative to human expectations, despite general competence elsewhere. "In contrast, application-specific limitations reflect shortcomings tied to particular domains of importance, where models underperform despite human expectations of competence."
- Belief tracking: Maintaining and updating representations of other agents’ beliefs or states during interaction or reasoning. "Mitigation research thus targets (i) richer internal models like belief tracking and hypothesis testing \citep{li2023theory,cross2024hypothetical}..."
- Causal inference: Determining cause-effect relationships from data and reasoning, beyond mere correlation. "Using this approach, studies reveal LLM weaknesses in specific types of logic such as converse binary relations \citep{qi2023investigationllmsinefficacyunderstanding}, syllogistic reasoning \citep{ando2023evaluatinglargelanguagemodels}, causal inference \citep{joshi2024llmspronefallaciescausal}, and even shallow yes/no questions \citep{clark2019boolqexploringsurprisingdifficulty}."
- Causal masking: The Transformer’s mechanism that restricts attention to prior tokens to ensure autoregressive generation, potentially inducing order-based biases. "Second, architectural features of the model -- such as the Transformer’s causal masking -- introduce predispositions toward order-based biases independent of data \citep{wu2025emergence, dufter2022position}."
- Chain-of-Thought (CoT): A prompting strategy that elicits step-by-step reasoning in the output to improve problem-solving. "Efforts to enhance these skills correspondingly include advanced prompting like Chain-of-Thought (CoT) \citep{wei2022chain}, retrieval augmentation \citep{xu2023retrieval}, fine-tuning with deliberately injected interference \citep{li2022largelanguagemodelscontrollable}, multimodality \citep{hao2025visual}, and architectural innovations to mimic human attention mechanisms \citep{wu2024cognitive}."
- Cognitive biases: Systematic deviations from rational judgment caused by heuristics, limited resources, or contextual influences. "Cognitive biases -- systematic deviations from rational judgment -- are well-studied in human reasoning \citep{tversky1974judgment, tversky1981framing}."
- Cognitive flexibility: The ability to switch between tasks or adapt to new rules efficiently. "Lastly, cognitive flexibility, the skill of adapting to new rules or switching tasks efficiently, remains a challenge, especially in rapid task switching and adaptation to new instructions \citep{kennedy2024cognitive}."
- Compositional reasoning: Combining multiple pieces of information or sub-results into a coherent overall inference. "Compositional reasoning requires combining multiple pieces of knowledge or arguments into a coherent inference."
- Confirmation bias: Favoring information that confirms prior beliefs or context, while discounting contradictory evidence. "Regarding content, LLMs struggle more with abstract or unfamiliar topics -- a phenomenon known as the “content effect” \citep{lampinen2024language} -- and tend to favor information that aligns with prior context or assumptions, reflecting human-like confirmation bias \citep{o2025confirmation, shi2024argumentative, malberg2024comprehensive, wan2025unveilingconfirmationbiaschainofthought, zhu2024conformity}."
- Context engineering: Systematically optimizing the entire input payload (context) provided to an LLM during inference for robustness and performance. "In parallel, the recent rise of context engineering \citep{mei2025survey} -- which focuses on a systematic optimization of the entire information payload fed to an LLM during inference -- is increasingly seen as a more robust alternative to traditional prompt engineering in MAS."
- Counting: Discrete enumeration of elements; a basic numerical capability that current LLMs often fail to perform reliably. "Despite its simplicity, counting poses a notable fundamental challenge for LLMs \citep{xu2024llmgeniusparadoxlinguistic, chang2024languagemodelsneedinductive, zhang-he-2024-large, fu2024largelanguagemodelsllms, conde2025can, yehudai2024transformerscountn}, even the reasoning ones \citep{malek2025frontierllmsstrugglesimple}..."
- Embodied reasoning: Reasoning grounded in physical interaction with environments, relying on spatial intelligence and real-time feedback. "Embodied reasoning depends on physical interaction with environments, fundamentally relying on spatial intelligence and real-time feedback \citep{shapiro2019embodied, barsalou2008grounded}."
- Emotional Intelligence (EI): The capability to perceive, understand, and manage emotions in oneself and others. "Furthermore, current models exhibit deficits in emotional reasoning. This includes difficulties in emotional intelligence (EI) \citep{sabour2024emobench, hu2025emobench, amirizaniani2024llms, vzorinab2024emotional}..."
- Executive functions: Core cognitive processes—working memory, inhibitory control, and cognitive flexibility—essential for goal-directed reasoning. "A prominent example is the set of core executive functions -- working memory \citep{baddeley2020working}, inhibitory control \citep{diamond2013executive,williams1999development}, and cognitive flexibility \citep{canas2006cognitive} -- essential in human reasoning \citep{diamond2013executive}."
- False belief tasks: Assessments of Theory of Mind where one must reason about others holding beliefs that are incorrect. "Typically emerging in early childhood with milestones like passing false belief tasks (understand that others' beliefs may be incorrect or different) \citep{wimmer1983beliefs}..."
- Framing effects: Variations in decisions or outputs caused by differences in how equivalent information is presented. "Framing effects further influence outputs: logically equivalent but differently phrased prompts can lead to different results \citep{jones2022capturing, suri2024large, nguyen2024human, lior2025wildframe, robinson2025framing, shafiei2025more}."
- Fundamental failures: Model shortcomings intrinsic to architectures or training that broadly affect diverse tasks. "Fundamental failures are usually intrinsic to LLM architectures, manifesting broadly and universally across diverse downstream tasks."
- Graph-structured reasoning path data: Training data that encodes reasoning steps as graph structures to guide logical inference. "while \citet{zhou2024enhancinglogicalreasoninglarge} enhances training with graph-structured reasoning path data, similar to distilling CoT reasoning process into training data \citep{yu2024distilling21}."
- Group attribution bias: The tendency to ascribe characteristics or outcomes to group membership rather than individual actions. "Social cognitive biases also influence LLM outputs, including group attribution bias \citep{hamilton1976illusory, allison1985group, raj2025talent} and negativity bias \citep{rozin2001negativity}..."
- Inhibitory control: The ability to suppress impulsive or default responses when inappropriate. "Inhibitory control -- the ability to suppress impulsive or default responses when contexts demand -- is also weak in LLMs, with them often sticking to previously learned patterns even when contexts shift \citep{han2024incontextlearningelicittrustworthy, patel2025deficient}..."
- Large Reasoning Models (LRMs): Advanced LLMs optimized or specialized for complex reasoning tasks. "Beyond perturbations, a rising approach utilizes meta-theorems such as the Monadic Second-Order logic from CS theory to synthesize algorithmic coding problems at scale \citep{beniamini2025formulaonemeasuringdepthalgorithmic}, posing a significant challenge even for state-of-the-art large reasoning models (LRMs) \citep{xu2025largereasoningmodelssurvey}."
- Logic-preserving transformations: Modifications to benchmark items that change surface form while keeping the underlying logical task identical. "these studies introduce logic-preserving transformations based on particular task structures, such as reordering options in multiple-choice questions (MCQs) \citep{zheng2023large, pezeshkpour2023large, alzahrani2024benchmarks, gupta2024changinganswerorderdecrease, ni2024trainingbenchmarkneed}..."
- Math Word Problems (MWPs): Natural-language math tasks requiring translation into formal operations and multi-step reasoning. "MWPs exemplify this, as their logic can be readily abstracted into reusable templates."
- Monadic Second-Order logic: A powerful logical framework allowing quantification over sets, used to formally describe and synthesize algorithmic problems. "Beyond perturbations, a rising approach utilizes meta-theorems such as the Monadic Second-Order logic from CS theory to synthesize algorithmic coding problems at scale \citep{beniamini2025formulaonemeasuringdepthalgorithmic}..."
- Multi-Agent Systems (MAS): Systems composed of interacting agents that coordinate to solve tasks via communication and planning. "In Multi-Agent Systems (MAS), explicit social reasoning is the capacity of AI systems to collaboratively plan and solve complex tasks, an area challenging for current LLMs."
- Multi-Head Self-Attention (MHSA): A Transformer mechanism with multiple attention heads enabling parallel focus on different token relations. "Toward this, \citet{li2024understandingpatchingcompositionalreasoning} identifies faulty implicit reasoning in mid-layer multi-head self-attention (MHSA) modules and edit them..."
- Negativity bias: The tendency to weigh negative information more heavily than positive information in judgment. "Social cognitive biases also influence LLM outputs, including group attribution bias \citep{hamilton1976illusory, allison1985group, raj2025talent} and negativity bias \citep{rozin2001negativity}..."
- Neuro-symbolic augmentations: Methods that integrate neural networks with symbolic components to enable exact or structured reasoning. "and neuro-symbolic augmentations that enable internal arithmetic reasoning \citep{dugan2024occamllmfastexactlanguage}."
- Non-embodied reasoning: Cognitive processes that do not require physical interaction with environments. "Non-embodied reasoning comprises cognitive processes not requiring physical interaction with environments."
- Object affordance: The possible actions or interactions that an object allows, given its properties and context. "Failures of LLMs include lack of knowledge about object attributes (e.g., size, weight, softness) \citep{wang2023newtonlargelanguagemodels, liu2022things, shu2023posqa, Kondo_Sugawara_Aizawa_2023}, spatial relationships (e.g., above, inside, next to) \citep{liu2022things, shu2023posqa, Kondo_Sugawara_Aizawa_2023}, simple physical laws (e.g., gravity, motion, and force) \citep{Gregorcic_Pendrill_2023}, and object affordance (possible actions/reactions an object can make) \citep{Aroca-Ouellette_Paik_Roncone_Kann_2021, adak2024text2afford, pensa2024multi}."
- Order bias: Sensitivity of outputs to the sequence in which information or options are presented. "LLMs are highly sensitive to the order in which information is given, exhibiting order bias \citep{koo2023benchmarking, pezeshkpour2023large, jayaram2024mitigating, guan2025order, cobbina2025show}..."
- Positional encoding: A technique in Transformers that injects token position information into embeddings to enable order-aware attention. "Although the failures manifest at the application level, much work suggest that they originate primarily from architectural and representational limits, including tokenization \citep{zhang2024countingabilitylargelanguage, shin2024largelanguagemodelslack}, positional encoding \citep{chang2024languagemodelsneedinductive}, and training data composition \citep{allenzhu2024physicslanguagemodels31}..."
- Proactive interference: The phenomenon where earlier information disrupts the learning or retrieval of newer information. "In particular, LLMs suffer from ``proactive interference'' to a much larger extent than humans, where earlier information significantly disrupts retrieval of newer updates \citep{wang2025unableforgetproactivelnterference}."
- Reinforcement Learning from Human Feedback (RLHF): A training process where models are optimized to produce outputs preferred by human raters. "Third, alignment processes like Reinforcement Learning from Human Feedback (RLHF) amplify biases by aligning model behavior with human raters who are themselves biased \citep{sumita2025cognitive, perez2023discovering}."
- Retrieval augmentation: Enhancing model reasoning by fetching and incorporating relevant external information during inference. "Efforts to enhance these skills correspondingly include advanced prompting like Chain-of-Thought (CoT) \citep{wei2022chain}, retrieval augmentation \citep{xu2023retrieval}..."
- Reversal curse: A failure where models trained on “A is B” do not infer the logically equivalent “B is A.” "A representative failure of LLMs is reversal curse: despite being trained on
A is B,'' models often fail to infer the equivalentB is A'' -- a trivial bidirectional equivalence for humans." - Robustness vulnerabilities: Weaknesses where model performance degrades under minor, logic-preserving variations in inputs or tasks. "While there has been clear progress from early models like GPT-3 -- which largely failed at ToM tasks -- to newer models such as GPT-4o and reasoning models like o1-mini, which can solve many standard ToM tests, their underlying reasoning remains brittle under simple perturbations \citep{gu2024simpletom, zhou2023far}... such failures often result in significant robustness vulnerabilities."
- Syllogistic reasoning: Deductive reasoning using structured premises (syllogisms) to derive logical conclusions. "Using this approach, studies reveal LLM weaknesses in specific types of logic such as converse binary relations \citep{qi2023investigationllmsinefficacyunderstanding}, syllogistic reasoning \citep{ando2023evaluatinglargelanguagemodels}..."
- Theory of Mind (ToM): The capacity to attribute and reason about beliefs, intentions, and emotions of oneself and others. "ToM is the cognitive ability to attribute mental states -- beliefs, intentions, emotions -- to oneself and others, and to understand that others' mental states may differ from one's own \citep{frith2005theory}."
- Tokenization: Segmenting text into tokens for model processing, which influences representation and downstream abilities. "Although the failures manifest at the application level, much work suggest that they originate primarily from architectural and representational limits, including tokenization \citep{zhang2024countingabilitylargelanguage, shin2024largelanguagemodelslack}, positional encoding \citep{chang2024languagemodelsneedinductive}, and training data composition \citep{allenzhu2024physicslanguagemodels31}..."
- Two-hop reasoning: Inferring conclusions by combining exactly two related facts, often across documents or contexts. "Studies show systematic failures in basic two-hop reasoning -- combining only two facts across documents -- and even worsening performance with increased compositional depth and the addition of distractors \citep{zhao2024exploringlimitationslargelanguage, xu2024largelanguagemodelscompositional, guo2025llmsperformtwohopreasoning}."
- Zipf's law: A statistical distribution where the frequency of items is inversely proportional to their rank, impacting data coverage and learning. "\citet{golovneva2024reversetrainingnursereversal} further argues that scaling alone cannot resolve the issue due to Zipf's law \citep{Newman_2005}."
Practical Applications
Immediate Applications
- Robustness-driven LLM audits for deployment (software, healthcare, finance, legal, customer support)
- Use the survey’s taxonomy and logic-preserving perturbations (e.g., option reordering, premise/event rearrangement, harmless entity swaps) to stress-test models before release, catching fundamental, application-specific, and robustness failures; Tools/Workflows: metamorphic test generators for math/coding/QA, CI/CD gates that fail on instability, private/live internal benchmarks; Assumptions/Dependencies: access to representative task data, ability to run large test suites, guarding against benchmark contamination.
- Prompt and context engineering “linters” to reduce cognitive biases (software, education, enterprise productivity)
- Encode known bias risks (anchoring, order, framing, verbosity, distractions) into automated checks and recommendations that restructure prompts/contexts and enforce verification steps (e.g., Chain-of-Thought, self-critique); Tools/Workflows: prompt-linting plugins, context optimizers, structured templates with rationale and cross-checks; Assumptions/Dependencies: model supports longer contexts/rationales, cost/latency budget for deliberation.
- Human-in-the-loop routing for weak-reasoning regimes (science, education, enterprise analytics)
- Based on known deficits (counting, arithmetic, physics reasoning, ToM, moral consistency), route sub-tasks to tools/humans: calculators/CAS for math, retrieval for factual gaps, policy checklists for ethics; Tools/Workflows: orchestration layer that classifies queries by failure category and routes to tools/approvers; Assumptions/Dependencies: tool availability, integration infrastructure, clear escalation criteria.
- Multi-agent safety scaffolding (software automation, operations)
- Reduce MAS failures via structured roles, belief-tracking, mandatory verification rounds, inspector/challenger agents, and explicit termination checks; Tools/Workflows: reusable MAS protocols, “watchdog” agents, conversation state graphs; Assumptions/Dependencies: tolerance for added latency/cost, careful system design to avoid error cascades.
- Moral/ethics and social-norm consistency checks (platform policy, HR, healthcare triage, legal)
- Apply standardized ToM/social-norm/moral benchmarks with paraphrase/language variations to detect inconsistencies and jailbreak susceptibility; Tools/Workflows: compliance checklists mapped to the survey’s taxonomy, red-teaming suites seeded with known failure patterns; Assumptions/Dependencies: organizational policy clarity, multilingual evaluation capacity, reviewer training.
- Safer model procurement and vendor evaluation (policy, enterprise IT governance)
- Require vendors to report robustness under perturbations, reversal curse tests, compositional reasoning checks, and bias audits; Tools/Workflows: RFP templates and model cards that include failure profiles and mitigation plans; Assumptions/Dependencies: willingness of vendors to disclose, standardized test protocols.
- Education and AI literacy modules (academia, K–12, professional training)
- Use curated examples (e.g., reversal curse, framing effects, counting errors) to teach limitations and safe usage patterns; Tools/Workflows: classroom kits, lab assignments employing perturbation-based evaluations and CoT; Assumptions/Dependencies: access to models or recorded outputs, instructor preparedness.
- Content moderation and communications hygiene (trust & safety, marketing, newsrooms)
- Counteract negativity/popularity biases and confirmation bias via prompt steering and two-pass generation (neutral reframe + fact-check pass); Tools/Workflows: activation/post-processing steering, policy-tuned prompts, reviewer dashboards that flag bias-inducing frames; Assumptions/Dependencies: model supports steerability, clear editorial policies.
- Coding and math evaluation hardening (software engineering, online judges, edtech)
- Adopt docstring/identifier perturbations, control-flow swaps, numeric resampling, and solvability checks to prevent overfitting and assess true reasoning; Tools/Workflows: robustness harnesses for code (renaming, syntax edits, adversarial mutations) and math (entity swaps, blank-inversions, variable roles); Assumptions/Dependencies: execution sandboxes, contamination-free datasets.
- Product UX for “deliberate mode” and uncertainty surfacing (software products, consumer apps)
- Add UI options for explicit reasoning steps, alternative framings, and automatic rechecks under perturbed prompts; Tools/Workflows: CoT toggles, multi-sample agreement, result stability meters; Assumptions/Dependencies: user tolerance for latency, support for rationale generation.
- Domain-specific data curation for debiasing and symmetry (model devs, MLOps)
- Integrate bidirectional fact variants to mitigate reversal curse, graph-structured reasoning paths for compositionality, and adversarial/perturbed samples for robustness; Tools/Workflows: data pipelines that auto-generate reversals, semantic unit permutations, and graph-labeled examples; Assumptions/Dependencies: data rights, compute for fine-tuning, evaluation against leakage.
- Operations checklists for daily use of chatbots (daily life, SMBs)
- Practical habits: avoid leading questions, test answer stability via minor rephrasings, verify math with calculators, and ask the model to list assumptions; Tools/Workflows: lightweight browser extensions that auto-perturb prompts and compare answers; Assumptions/Dependencies: user consent to extra steps, minimal UI friction.
Long-Term Applications
- Architecture-level improvements for core cognitive skills (AI R&D, foundational model labs)
- Design models with enhanced working memory, inhibitory control, and cognitive flexibility (e.g., new attention mechanisms, memory modules) and improved positional/token representations to fix counting/arithmetic; Tools/Workflows: hybrid attention/memory blocks, alternative positional encoding, curriculum learning; Assumptions/Dependencies: significant research cycles, training compute, rigorous ablation frameworks.
- Bidirectional/contrastive training regimes to overcome directional asymmetries (model training)
- Systematic objectives that enforce logical symmetry (A→B and B→A) and compositional consistency across tasks; Tools/Workflows: contrastive data pairs, loss terms for symmetry/compositional constraints, verification-by-edit objectives; Assumptions/Dependencies: large-scale curated data, risk of catastrophic interference, careful evaluation to avoid shortcut learning.
- Neuro-symbolic reasoning modules “under the hood” (software, finance, engineering, science)
- Integrate exact solvers (arithmetic engines, theorem provers, program analyzers) as internal components rather than external calls, enabling reliable reasoning where LLMs are weak; Tools/Workflows: differentiable interfaces to solvers, internal tool-selection policies, caching of verified sub-results; Assumptions/Dependencies: latency/complexity overheads, compatibility with training/inference stacks.
- Embodied and multimodal grounding for physical commonsense (robotics, AR/VR, logistics)
- Train models with sensorimotor traces, physics simulators, and vision-language-robotics curricula to reduce physical reasoning gaps; Tools/Workflows: simulation-to-real pipelines, multimodal datasets with affordance labels, RL with language supervision; Assumptions/Dependencies: high-fidelity simulators, safety constraints, domain transfer challenges.
- Standardized societal alignment batteries and governance (policy, regulators, standards bodies)
- Sector-wide certifications that test ToM robustness, social norm diversity, and moral consistency under paraphrase/multilingual shifts; Tools/Workflows: certification suites maintained as private/live benchmarks, third-party auditors, reporting standards; Assumptions/Dependencies: consensus on norms/metrics, handling cultural variance, legal frameworks.
- Resilient multi-agent infrastructures with formal guarantees (enterprise automation, defense, autonomous systems)
- MAS architectures with formalized protocols, belief/goal modeling, verification phases, and termination proofs; Tools/Workflows: protocol DSLs, runtime monitors, theorem-proved coordination templates, “defense-in-depth” agents; Assumptions/Dependencies: formal methods expertise, performance trade-offs, adversarial robustness requirements.
- Continually evolving evaluation ecosystems (academia, benchmarking firms)
- Live/private benchmark infrastructures that auto-generate logic-preserving variants and harder instances (e.g., algorithmic problem synthesis via meta-theorems); Tools/Workflows: dynamic test generation services, contamination tracking, longitudinal scorecards; Assumptions/Dependencies: funding and stewardship, secure data handling, community adoption.
- Cultural and affective competence modeling (global platforms, healthcare, education)
- Richer models of emotion and culture-aware interpretations to mitigate EI deficits and cross-cultural misalignments; Tools/Workflows: multilingual/cultural datasets, persona-conditioned training, affective grounding from multimodal signals; Assumptions/Dependencies: ethical data sourcing, bias and privacy safeguards, risk of stereotyping.
- Organizational MLOps for failure-aware lifecycles (enterprise AI)
- End-to-end pipelines that tag features by failure category, enforce pre-deployment robustness gates, monitor post-deployment drift in reasoning stability, and auto-trigger retraining with targeted perturbations; Tools/Workflows: taxonomy-aware feature registries, drift detectors for instability, automated dataset augmentation; Assumptions/Dependencies: mature MLOps stack, cross-functional governance.
- Public AI literacy and consumer protections (policy, civil society)
- Long-term programs that codify best practices for end-users, require plain-language disclosure of model limitations, and provide recourse when reasoning failures cause harm; Tools/Workflows: consumer labels, complaint handling standards, educational curricula; Assumptions/Dependencies: legislative support, coordination with industry, measurement of harm/benefit.
- Domain-specific “reasoning skeletons” and toolchains (engineering, finance, medicine, law)
- Pre-built, auditable reasoning workflows that decompose tasks into validated steps, bind critical steps to tools/humans, and log rationales for compliance; Tools/Workflows: sectoral templates (e.g., clinical differential diagnosis with calculator and guideline checks; compliance reviews with rule libraries); Assumptions/Dependencies: regulatory acceptance, interoperability with legacy systems, traceability requirements.
- Research and teaching infrastructure built on the survey’s repository (academia)
- Expand the “Awesome LLM Reasoning Failures” resource into course modules, shared datasets, and open leaderboards focusing on failure analysis and mitigation; Tools/Workflows: curated corpora of failure cases, standardized reporting formats, reproducible baselines; Assumptions/Dependencies: sustained community contributions, funding, and maintenance.
Notes on feasibility across applications
- Many immediate applications rely on black-box access to models and can be added as wrappers, evaluators, or orchestration layers; costs are primarily engineering and compute for testing.
- Training-centric mitigations (debiasing, symmetry data, architectural changes) depend on data rights, large compute budgets, and rigorous evaluation to avoid regressions.
- Multi-agent and embodied applications introduce latency, complexity, and safety concerns; formal verification and simulation fidelity are key dependencies.
- Ethical, cultural, and policy-aligned applications require consensus on norms, multilingual coverage, and safeguards against reinforcing biases.
Collections
Sign up for free to add this paper to one or more collections.