Language Hallucination in LLMs
- Language hallucination is the phenomenon where LLMs produce fluent output that deviates from factual accuracy due to intrinsic and extrinsic errors.
- It is measured using token-level metrics and evaluated with benchmarks like TruthfulQA and FactScore, revealing significant discrepancies across languages and modalities.
- Mitigation strategies such as retrieval-augmented generation, prompt interventions, and fine-tuning are employed to manage and reduce hallucination risks.
LLMs frequently generate outputs that are fluent and syntactically correct but factually inaccurate, unsupported, or inconsistent with their inputs or external realities. This phenomenon—termed language hallucination—manifests whenever model-generated content diverges from ground truth, the provided context, or established external knowledge. Language hallucination is recognized as both an engineering defect and an inevitable structural feature of statistical language modeling, especially under open-world conditions where models must extrapolate to unseen or poorly specified inputs (Xu, 29 Sep 2025, Alansari et al., 5 Oct 2025).
1. Formal Foundations and Taxonomy
Language hallucination is formally equated with generalization error in the learning-theoretic sense. For an LLM implementing a function mapping input contexts to outputs, a hallucination occurs if for ground truth . The true risk reflects the model's hallucination propensity over the test distribution —hallucination is present if . Under the closed-world assumption (), hallucinations can be minimized with sufficient data and model constraint. Under the open-world assumption ( admits support outside ), hallucinations are theoretically inevitable regardless of data scale (Xu, 29 Sep 2025).
Taxonomies distinguish hallucinations by source, manifestation, and context:
- Intrinsic: Outputs contradict provided input or context.
- Extrinsic: Outputs add plausible but unverified information.
- Factuality error: Contradicts or fabricates real-world facts.
- Faithfulness error: Diverges from the source or instruction, producing irrelevant, incomplete, or illogical content.
- Type-I (False Memorization): Contradicts information present in training data—corrigible via retraining.
- Type-II (False Generalization): Errors on truly novel or never-seen inputs—irreducible under open-world settings (Xu, 29 Sep 2025, Alansari et al., 5 Oct 2025).
Hallucinations are further subclassified in task- and modality-specific contexts (e.g., object-existence, attribute, relational, counting, parametric, or logical forms in vision-LLMs) (Lan et al., 2024, Fu et al., 2024).
2. Quantification and Benchmarks
Language hallucination is quantified at various granularities, from tokens to sentences to complete documents. The prototypical metric is the token-level hallucination rate:
where 0 is the number of hallucinated tokens and 1 is the total number of generated tokens. At the answer level, the rate is: 2 (Thoresen et al., 4 May 2026, Alansari et al., 5 Oct 2025)
Empirical studies use both automatic and human benchmarks:
- TruthfulQA, HaluEval, HaluQA, Absinth for QA and summarization (Alansari et al., 5 Oct 2025).
- MultiWikiQHalluA: token-level faithfulness hallucinations across 306 languages (Thoresen et al., 4 May 2026).
- AuthenHallu: hallucinations in authentic LLM-human interactions, revealing rates of ~31% in general dialogues and 60% in math (Ren et al., 12 Oct 2025).
- Medical QA—textbook-grounded hallucination rate in LLaMA-70B-Instruct is 19.7% (95% CI: 18.6–20.7), despite 98.8% plausibility (Colelough et al., 12 Feb 2026).
- FactScore: atomic-fact-based scoring for cross-lingual hallucination analysis (Chataigner et al., 2024).
A broad spectrum of auxiliary metrics is also used, such as ROUGE/L, BLEU, BERTScore, BLEURT, self-consistency, human faithfulness ratings, and agreement statistics (e.g., quadratic Cohen's 3, Kendall's 4) (Colelough et al., 12 Feb 2026, Ren et al., 12 Oct 2025).
3. Multilingual and Modality-Specific Phenomena
Hallucination rates are strongly modulated by language resource level and modality:
- Multilingual gaps: Hallucination rates are systematically higher in low-resource languages. For instance, token-level hallucination rates on Icelandic reach 0.36 (0.60 at answer level), compared to 0.03/0.07 for English (Thoresen et al., 4 May 2026). FactScore analyses show median factuality drops by over 35 points from English to Javanese in free-form biography tasks (Chataigner et al., 2024).
- Root causes: Lower-resource languages are underrepresented in pretraining data, suffer from tokenization artifacts, and lack task-aligned ground-truth corpora (Chataigner et al., 2024, Das et al., 30 Jul 2025).
- Vision-LLMs (VLMs): Hallucinations manifest as object hallucination (generating entities not present in the image), attribute errors, and multimodal conflicts. Rate and severity are benchmarked by CHAIR, POPE, and AMBER metrics (Lan et al., 2024, Fu et al., 2024).
- Modality gap: Weak visual-text alignment, parametric knowledge leakage, and overconfident text decoders induce large instance- and type-specific hallucination rates (Lan et al., 2024).
4. Detection and Analysis Methodologies
Detection approaches are categorized by their requirement for external references, granularity, and supervision:
- Reference-based: Retrieval-augmented checks compare generation to trusted sources using entailment or fact-verification models; span-level classifiers highlight unsupported regions (Alansari et al., 5 Oct 2025, Wang et al., 2024).
- Uncertainty-based: High Shannon entropy or predictive uncertainty (especially epistemic) correlates with hallucination risk. Penalizing epistemic uncertainty during decoding can reduce hallucinations (Xiao et al., 2021, Alansari et al., 5 Oct 2025).
- Self-consistency: Diversity or contradictions among multiple sampled generations flag hallucination (e.g., AutoHall's self-contradiction check) (Cao et al., 2023).
- Learning-based: Supervised token- or span-level classifiers; iterative, EM-style self-labeling (ANAH-v2) yields state-of-the-art annotators surpassing GPT-4 on fine-grained detection (Gu et al., 2024).
- Embedding/geometry-based: Response and reference embeddings cluster in semantic space; hallucinations are reliably distinguished by centroid or distance-based rules (Zavhorodnii et al., 6 Oct 2025).
- Hybrid production systems: Cascaded NER, NLI, and span-based detectors integrated with LLM-based rewriting achieve offline F1≈0.87, dynamically balancing cost, latency, and accuracy (Wang et al., 2024).
5. Root Causes and Theoretical Inevitability
Hallucination is underpinned by the generalization structure of statistical learning: in the open world, unseen or out-of-support test instances guarantee the existence of inputs for which model predictions are unconstrained by data. No conceivable amount of supervised data can eliminate Type-II hallucinations (false generalization) in real-world settings (Xu, 29 Sep 2025). Data noise, domain gaps, architectural artifacts (e.g., unidirectional context, exposure bias), and objective misalignment (MLE, RLHF with insufficient negative signal) amplify error rates (Alansari et al., 5 Oct 2025).
In multilingual and multimodal models, "modality gap," poor instruction following, and pretraining biases further increase both baseline and adversarial hallucinations. For lower-resource settings, architectural and tokenization mismatches exacerbate error rates (Chataigner et al., 2024, Thoresen et al., 4 May 2026).
6. Mitigation Strategies and Engineering Implications
Mitigation approaches span all stages of the modeling pipeline:
- Prompt and decoding interventions: Chain-of-thought, retrieval-conditioned prompts, control tokens, and uncertainty-aware decoding reduce risk without retraining (Xiao et al., 2021, Alansari et al., 5 Oct 2025).
- Retrieval-based generation: Retrieval-augmented generation and knowledge-graph integration robustly lower hallucination, especially for factual tasks (Alansari et al., 5 Oct 2025, Wang et al., 2024).
- Data-centric fine-tuning: Multilingual supervised fine-tuning (SFT), cross-lingual alignment for hallucination-aware data pairs, and fine-grained contrastive data generation enhance robustness in both monolingual and multilingual models (Qu et al., 2024, Fu et al., 2024).
- Preference optimization: Direct preference optimization (DPO) and hallucination-targeted variants (HDPO, HA-DPO) enforce faithfulness by explicitly penalizing hallucination-rich outputs during training, with gains exceeding 50% absolute reduction in captioning hallucinations and up to +19 percentage points in accuracy across languages (Fu et al., 2024, Qu et al., 2024).
- Downstream filtering and calibration: Lightweight plug-and-play annotators, reranking by hallucination likelihood (ANAH-v2), and calibrated token-level scores enable efficient filtering, human-in-the-loop annotation, and adaptive rerouting for high-stakes deployments (Gu et al., 2024, Das et al., 30 Jul 2025, Colelough et al., 12 Feb 2026).
- Interpretability and error tracing: Concept-level, causal-graph, and logical-form representations, as well as explainable NLI-based classifiers, make error sources and inductive steps intelligible to practitioners (Xu, 29 Sep 2025, Zavhorodnii et al., 6 Oct 2025).
AGI engineering must shift from "hallucination elimination" to structuring, tolerating, and managing errors of generalization—approaches that tolerate and make explicit uncertainty will be essential for safe and adaptive deployment in dynamic, unbounded environments (Xu, 29 Sep 2025).
7. Open Challenges and Research Directions
Key open problems include:
- Building universally reference-free, zero-shot hallucination detectors that generalize across domains and model families (Alansari et al., 5 Oct 2025).
- Converging multilingual hallucination gaps through tailored data augmentation, language-specific tokenization, and balanced pretraining (Chataigner et al., 2024, Thoresen et al., 4 May 2026).
- Theorizing and diagnosing cross-modal and chain-of-reasoning hallucinations, especially in vision–language and multi-step tasks (Lan et al., 2024, Fu et al., 2024).
- Developing scalable, cost-effective, and explainable benchmarks and detection pipelines that can operate at web scale and with human-parity reliability (Gu et al., 2024, Ren et al., 12 Oct 2025).
- Automatic, fine-grained calibration of model confidence and dynamic integration of retrieval/inference modules for robust hallucination mitigation in production systems (Wang et al., 2024).
Hallucination remains a fundamentally unsolved and theoretically irreducible challenge under open-world assumptions, yet an expanding methodological ecosystem offers tangible avenues for rigorously characterizing, detecting, and managing its various manifestations in leading LLM and VLM architectures. Research continues to anchor the field toward more truthful, calibrated, and interpretable generative models (Xu, 29 Sep 2025, Alansari et al., 5 Oct 2025, Thoresen et al., 4 May 2026, Wang et al., 2024).