Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs (2404.09971v2)
Abstract: LLMs are prone to hallucinations, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's generation, typically computing representative vectors of hallucinations vs. grounded generations, for steering the model's hidden states away from a hallucinatory state. However, common studies employ different setups and do not properly separate different possible causes of hallucinations, making interventions misguided. In this work, we introduce a method for categorizing examples based on the model's prior knowledge, named WACK. We construct WACK benchmarks that support interventions in two settings: open-book and closed-book question answering. Using the benchmarks, we perform an extensive investigation of the effect of different choices for intervention, such as the intervened components, and how often and how strongly to intervene. We find that intervention success varies depending on the component, with the attention blocks performing well and the residual stream proving detrimental to LLMing capabilities. We also show that interventions can benefit from representative vectors collected before, rather than after, a hallucination occurs. Finally, we introduce a new dynamic intervention, which intervenes only if needed, and thus is more robust than standard static interventions. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation .
- The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
- Identifying and controlling important neurons in neural machine translation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1z-PsR5KX.
- The poison of alignment. arXiv preprint arXiv:2308.13449, 2023.
- Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2022.
- Do androids know they’re only dreaming of electric sheep? arXiv preprint arXiv:2312.17249, 2023.
- Inside: Llms’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2023.
- Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 20967–20974, 2024.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
- Chain-of-verification reduces hallucination in large language models. ArXiv, abs/2309.11495, 2023. URL https://api.semanticscholar.org/CorpusID:262062565.
- TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 2053–2070, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.127. URL https://aclanthology.org/2023.emnlp-main.127.
- Llm factoscope: Uncovering llms’ factual discernment through intermediate data analysis. arXiv preprint arXiv:2312.16374, 2023.
- Nl-iti: Optimizing probing and intervention for improvement of iti method. arXiv preprint arXiv:2403.18680, 2024.
- Janus. Simulators. LessWrong, 2022. Available at https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1601–1611, 2017.
- Personas as a way to model truthfulness in language models. arXiv preprint arXiv:2310.18168, 2023.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
- Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
- Chain of natural language inference for reducing large language model ungrounded hallucinations. arXiv preprint arXiv:2310.03951, 2023.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 9004–9017, 2023.
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
- Pointer sentinel mixture models. In International Conference on Learning Representations, 2016.
- Cleo Nardo. The waluigi effect (mega-post). LessWrong, 2023. Available at https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post.
- Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655, 2022.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840, 2023.
- Weakly supervised detection of hallucinations in llm activations. In Annual Conference on Neural Information Processing Systems, 2023.
- Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Arianna Bisazza and Omri Abend (eds.), Proceedings of the 25th Conference on Computational Natural Language Learning, pp. 194–209, Online, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.conll-1.15. URL https://aclanthology.org/2021.conll-1.15.
- textit dial beinfo for faithfulness: Improving factuality of information-seeking dialogue via behavioural fine-tuning. arXiv preprint arXiv:2311.09800, 2023.
- Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
- Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739, 2023.
- Prompting gpt-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2022.
- The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3607–3625, 2023.
- Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pp. 566–581, 2022.
- Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Steering gpt-2-xl by adding an activation vector. LessWrong, 2023. Available at https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector.
- " my answer is c": First-token probabilities do not match text answers in instruction-tuned language models. arXiv preprint arXiv:2402.14499, 2024.
- Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
- Similarity analysis of contextual word representation models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4638–4655, 2020.
- The earth is flat because…: Investigating llms’ belief towards misinformation via persuasive conversation. arXiv preprint arXiv:2312.09085, 2023.
- Characterizing truthfulness in large language model generations with local intrinsic dimension. arXiv preprint arXiv:2402.18048, 2024.
- Whispers that shake foundations: Analyzing and mitigating false premise hallucinations in large language models. arXiv preprint arXiv:2402.19103, 2024.
- Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098, 2023.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
- Truthx: Alleviating hallucinations by editing large language models in truthful space. arXiv preprint arXiv:2402.17811, 2024.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
- Adi Simhi (5 papers)
- Jonathan Herzig (34 papers)
- Idan Szpektor (47 papers)
- Yonatan Belinkov (111 papers)