Distinguishing Ignorance from Error in LLM Hallucinations (2410.22071v2)
Abstract: LLMs are susceptible to hallucinations -- factually incorrect outputs -- leading to a large body of work on detecting and mitigating such cases. We argue that it is important to distinguish between two types of hallucinations: ones where the model does not hold the correct answer in its parameters, which we term HK-, and ones where the model answers incorrectly despite having the required knowledge, termed HK+. We first find that HK+ hallucinations are prevalent and occur across models and datasets. Then, we demonstrate that distinguishing between these two cases is beneficial for mitigating hallucinations. Importantly, we show that different models hallucinate on different examples, which motivates constructing model-specific hallucination datasets for training detectors. Overall, our findings draw attention to classifying types of hallucinations and provide means to handle them more effectively. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation .
- The internal state of an llm knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2022.
- Autohall: Automated hallucination dataset generation for large language models. arXiv preprint arXiv:2310.00259, 2023.
- Do androids know they’re only dreaming of electric sheep? arXiv preprint arXiv:2312.17249, 2023.
- Inside: Llms’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2023.
- Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 20967–20974, 2024.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2053–2070, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.127. URL https://aclanthology.org/2023.emnlp-main.127.
- Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904, 2024.
- Llm factoscope: Uncovering llms’ factual discernment through intermediate data analysis. arXiv preprint arXiv:2312.16374, 2023.
- Nl-iti: Optimizing probing and intervention for improvement of iti method. arXiv preprint arXiv:2403.18680, 2024.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Llm internal states reveal hallucination risk faced with a query. arXiv preprint arXiv:2407.03282, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017.
- Personas as a way to model truthfulness in language models. arXiv preprint arXiv:2310.18168, 2023.
- Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- Still no lie detector for language models: Probing empirical and conceptual roadblocks. Philosophical Studies, pages 1–27, 2024.
- The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv preprint arXiv:2401.03205, 2024a.
- Inference-time intervention: Eliciting truthful answers from a language model. NeurIPS, 2023.
- Measuring and controlling persona drift in language model dialogs. arXiv preprint arXiv:2402.10962, 2024b.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, 2023.
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
- Cleo Nardo. The waluigi effect (mega-post). LessWrong, 2023. Available at https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post.
- OpenAI. Introducing chatgpt. OpenAI, 2022. Available at https://openai.com/index/chatgpt/.
- How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. In The Twelfth International Conference on Learning Representations, 2023.
- Weakly supervised detection of hallucinations in llm activations. In Annual Conference on Neural Information Processing Systems, 2023.
- Towards understanding sycophancy in language models. In The Twelfth International Conference on Learning Representations, 2023.
- The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3607–3625, 2023.
- Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
- The earth is flat because…: Investigating llms’ belief towards misinformation via persuasive conversation. arXiv preprint arXiv:2312.09085, 2023.
- Llm lies: Hallucinations are not bugs, but features as adversarial examples. arXiv preprint arXiv:2310.01469, 2023.
- Characterizing truthfulness in large language model generations with local intrinsic dimension. arXiv preprint arXiv:2402.18048, 2024.
- Attention satisfies: A constraint-satisfaction lens on factual errors of language models. In The Twelfth International Conference on Learning Representations, 2023.
- How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
- Truthx: Alleviating hallucinations by editing large language models in truthful space. arXiv preprint arXiv:2402.17811, 2024.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.