Hallucinations in LLMs
- Hallucinations in large language models are outputs diverging from factual or contextually accurate responses due to inherent computational limits.
- The taxonomy categorizes hallucinations by intrinsic vs. extrinsic and factuality vs. faithfulness, outlining specific manifestations and evaluation criteria.
- Mitigation strategies involve architectural improvements, advanced detection methods, and calibrated human oversight for safer LLM deployments.
LLMs are generative neural architectures that, despite their fluency and general versatility, are intrinsically prone to producing hallucinations: outputs that are factually incorrect, fabricated, or unfaithful to either the input or established knowledge. A comprehensive understanding of LLM hallucinations spans formal definitions anchored in computability theory, multi-dimensional taxonomies distinguishing core categories, detailed characterization of causes and manifestations, methodological advances in detection and benchmarking, as well as architectural and systemic mitigation strategies. Hallucinations are established as theoretically inevitable for any computable LLM, setting an upper bound on reliability even with ideal training and architecture. Effective risk management requires robust methods for detection, mitigation, and calibrated human oversight, particularly for critical or sensitive deployments.
1. Formal Definition and Theoretical Unavoidability
The foundational formalism models hallucination in an LLM as the emission of output for an input string , which diverges from the ideal oracle function that captures the true mapping from inputs to outputs:
where is the computational process instantiated by the LLM after stages of training, and is the space of input prompts. Using diagonalization and related results from computability theory, the report demonstrates that hallucinations are an inescapable property of all computable LLMs, regardless of training data quality, architecture, or scaling; every possible training state will always leave some prompts for which fails to reproduce (Cossio, 3 Aug 2025).
2. Taxonomy: Core Dichotomies and Manifestations
The taxonomy distinguishes hallucinations along several orthogonal axes:
A. Intrinsic vs. Extrinsic Hallucinations
- Intrinsic hallucinations directly contradict provided input or contextual prompts; they originate from the model’s internal processing. E.g., a summary misreporting values present in the source text.
- Extrinsic hallucinations are unsupported by training data or external reality; they reference facts or entities that are fabricated or unverifiable (e.g., inventing events or persons) (Cossio, 3 Aug 2025).
B. Factuality vs. Faithfulness
- Factuality hallucinations contravene real-world, externally verifiable facts (e.g., “Lindbergh walked on the moon”).
- Faithfulness hallucinations diverge from the provided prompt, source, or context, potentially being internally plausible but unfaithful to instructions or data, such as summarizing a text inaccurately (Cossio, 3 Aug 2025, Hong et al., 8 Apr 2024).
C. Specific Manifestations
- Factual errors, fabrications, and invented details.
- Contextual inconsistencies (contradictions relative to the source).
- Instruction and logical inconsistencies (violating explicit directives or logical coherence).
- Temporal disorientation (incorrect chronology).
- Ethical/professional violations (libel, bias).
- Amalgamated or nonsensical outputs (merging unrelated facts or incoherence).
These axes are not mutually exclusive; a specific output may be categorized differently depending on reference set (input context, training data, world knowledge) and evaluation task.
3. Underlying Causes and Contributing Factors
Data-Related
- Training Data Quality: Incompleteness, low quality, or dataset bias lead to imitative falsehoods and omissions.
- Representation Gaps: Lack of rare, contemporary, or balanced data propagates blind spots and biases.
- Temporal Drift: Static data causes outdated outputs, a form of hallucination when current facts are required.
Model-Related
- Auto-Regressive Generation: Probabilistic next-token prediction amplifies minor errors, compounding hallucinations.
- Architectural Limitations: Restricted attention, context window, or internal state consistency affect factual reliability.
- Exposure Bias: Discrepancy between training conditions and exposure during inference (teacher-forcing vs. sampling).
- Decoding Strategies: High-temperature settings infuse randomness; stochastic decoding is linked to increased hallucination.
- Overconfidence/Calibration: Miscalibrated output probabilities may reinforce erroneous generation (Cossio, 3 Aug 2025, Yao et al., 2023).
Prompt-Related
- Ambiguous/Adversarial Prompts: Poorly specified, adversarial, or hypothesis-confirming prompts induce hallucinations by limiting available grounding or pushing the model toward speculative outputs (Yao et al., 2023, Jiang et al., 29 Mar 2024).
4. Detection, Evaluation, and Benchmarking
Detection of LLM hallucinations is operationalized along several methodologies:
Detection Category | Method Description | Examples/Benchmarks |
---|---|---|
Inference Classifier | Trained classifiers assess hallucination likelihood given generated output | FIB, ExHalder, HaluEval, Fact-checking (Ye et al., 2023) |
Uncertainty Metric | Measurement of model uncertainty (e.g., entropy, KL-divergence, BARTScore) | BARTScore, KLD, POLAR |
Self-Evaluation | LLM prompted to self-assess output consistency or generate multiple outputs | SelfCheckGPT, LM-know, LM-vs-LM |
Evidence Retrieval | Verification against external factual sources or retrieval-augmented evaluations | FActScore, CCV, RSE, FactKB, Wikipedia comparison (Chataigner et al., 23 Oct 2024) |
Key metrics and benchmarks:
- FActScore: Assesses factuality by decomposing responses into atomic facts, validated against sources such as Wikipedia (formally, ) (Chataigner et al., 23 Oct 2024).
- TruthfulQA: Measures the propensity for truthfulness versus imitative falsehood in open-domain Q&A.
- CHAIR, POPE: Metrics designed to evaluate object hallucinations in vision-language and embodied agent settings (Chakraborty et al., 18 Jun 2025).
- Human/Expert Annotation: Remains critical for edge cases and nuanced faithfulness assessments.
The “Hallucinations Leaderboard” aggregates factuality and faithfulness scores across multiple tasks and models for comparative assessment (Hong et al., 8 Apr 2024).
5. Architectural and Systemic Mitigation Strategies
Robust mitigation requires interventions at multiple layers:
Architectural
- Retrieval-Augmented Generation (RAG): Augments LLMs with external document search to ground outputs (Xu et al., 9 Mar 2025).
- Toolformer-style Augmentation: Fine-tunes LLMs to utilize APIs or symbolic calculators for external verification.
- Preference Optimization: Direct Preference Optimization (and variants such as FDPO, HSA-DPO) guides models away from hallucinations using fine-grained human or AI feedback (Gunjal et al., 2023, Xiao et al., 22 Apr 2024).
- Decoding-Time Contrastive Methods: Approaches such as ICD, Delta, and LCD suppress hallucinations by penalizing outputs supported by an “anti-expert” or masked-input model, favoring outputs grounded in available context (Zhang et al., 2023, Huang et al., 9 Feb 2025, Manevich et al., 6 Aug 2024).
Systemic
- Guardrails and Symbolic Integration: Factual filters, rule-based validators, and post-generation logic checks.
- Ensembling and Fallbacks: Leveraging heterogeneous models or external systems to cross-check and correct outputs (Guerreiro et al., 2023).
- User-Centered and Participatory Interfaces: Confidence aggregation (e.g., HILL), source display, and uncertainty indicators to prompt human recalibration of trust (Leiser et al., 11 Mar 2024).
- Multi-Agent Debate: Voting schemes among independently generated answers to reduce high-confidence errors ("delusions"), as empirically validated by dramatic reductions in delusional output following ensemble debate (Xu et al., 9 Mar 2025).
- Data Curation and Training: Filtering, augmenting, and balancing datasets to reduce bias and improve factual grounding.
6. Special Topics: Multimodal, Multilingual, and Creative Hallucinations
- Multimodal LLMs (LVLMs): Hallucinations can manifest as object, attribute, or relationship errors not supported by the visual context. Specialized detection and mitigation (e.g., sentence-level labeling, LCD, severity-aware optimization) yield reductions in visual hallucinations (Gunjal et al., 2023, Manevich et al., 6 Aug 2024, Xiao et al., 22 Apr 2024).
- Multilingual Gaps: Hallucination rates are systematically higher in low-resource languages, with lower FActScore values and greater variance, exposing fairness concerns in global deployment (Chataigner et al., 23 Oct 2024, Guerreiro et al., 2023).
- Creative Domains: In certain knowledge-poor or creative domains (e.g., drug discovery), hallucinated context (imaginative molecular descriptions) can improve performance on classification tasks, suggesting that harnessing creative hallucinations is domain-dependent and invites calibrated risk-benefit analysis (Yuan et al., 23 Jan 2025).
7. Cognitive, Human-Interaction, and Perceptual Factors
Users frequently under-detect hallucinations due to fluency or confidence heuristics, automation bias, or confirmation bias. The illusion of explanatory depth further exacerbates overtrust. User interface interventions such as uncertainty displays, source-grounding, and human-in-the-loop calibration are recommended as part of system-level mitigation (Cossio, 3 Aug 2025, Leiser et al., 11 Mar 2024). Studies confirm that even “correct” model outputs may contain residual subjective or minor errors that still mislead end users if undetected.
8. Conclusion and Future Directions
Given their computability-theoretic inevitability, hallucinations represent a foundational limitation that cannot be entirely eradicated by improved training or architecture (Cossio, 3 Aug 2025). Future research must focus on:
- Robust detection and continuous monitoring across modalities and languages.
- Modular and composable mitigation pipelines (ensemble fallbacks, symbolic filters, confidence metrics).
- Enhanced fine-grained evaluation via standardized human and automated benchmarks.
- Human-centered system design to foster critical trust and explicit uncertainty reasoning.
- Exploration of introspective LLMs equipped with self-skepticism tokens or explicit self-reflection mechanisms to flag potential hallucinations before exposure (Wu et al., 10 Sep 2024).
Ongoing advances in these directions remain essential for deploying LLMs safely and responsibly in mission-critical and high-stakes applications.