Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Constructing Benchmarks and Interventions for Combating Hallucinations in LLMs (2404.09971v2)

Published 15 Apr 2024 in cs.CL

Abstract: LLMs are prone to hallucinations, which sparked a widespread effort to detect and prevent them. Recent work attempts to mitigate hallucinations by intervening in the model's generation, typically computing representative vectors of hallucinations vs. grounded generations, for steering the model's hidden states away from a hallucinatory state. However, common studies employ different setups and do not properly separate different possible causes of hallucinations, making interventions misguided. In this work, we introduce a method for categorizing examples based on the model's prior knowledge, named WACK. We construct WACK benchmarks that support interventions in two settings: open-book and closed-book question answering. Using the benchmarks, we perform an extensive investigation of the effect of different choices for intervention, such as the intervened components, and how often and how strongly to intervene. We find that intervention success varies depending on the component, with the attention blocks performing well and the residual stream proving detrimental to LLMing capabilities. We also show that interventions can benefit from representative vectors collected before, rather than after, a hallucination occurs. Finally, we introduce a new dynamic intervention, which intervenes only if needed, and thus is more robust than standard static interventions. The code is available at https://github.com/technion-cs-nlp/hallucination-mitigation .

Comprehensive Analysis of White-Box Intervention Techniques for Mitigating Hallucinations in LLMs

Introduction to the Study

In the field of LLMs, a persistent issue is their tendency to produce incorrect or ungrounded statements, commonly referred to as hallucinations. These inaccuracies stem from a variety of causes, ranging from the model's failure to properly integrate input to discrepancies with real-world knowledge. While blackbox solutions have been explored to some extent, focusing on tweaking the model's output post-generation, there's growing interest in whitebox approaches. These involve intervening in the model's computation process to prevent hallucinations at their source. This paper presents an in-depth paper of white-box intervention techniques, offering new insights into their application and effectiveness.

Hallucination Types and Dataset Construction

The authors distinguish between three types of knowledge-related hallucinations in LLMs. They focus on what they term "type-3" hallucinations, where the model possesses the correct response within its parameters but fails to generate it. Adopting this nuanced classification allows for a more targeted approach to mitigating hallucinations. The methodology for constructing hallucination-laden datasets tailored to specific models is particularly noteworthy, facilitating a more accurate evaluation of intervention techniques in both open-book and closed-book settings.

Intervention Analysis

The intervention strategies explored in this work are comprehensive, covering different model components such as MLPs, attention blocks, heads, and residuals. The authors investigate the efficacy of interventions based on the timing (pre vs. post hallucination), the component of the architecture being modified, and the use of static versus dynamic interventions. Their findings reveal several key insights:

  • Different intervention components exhibit varying degrees of effectiveness, with attention components generally providing the best balance across metrics.
  • Pre-hallucination intervention strategies, where steering vectors are applied before the answer generation, tend to be more effective and less detrimental to model performance.
  • Dynamic intervention, which tailors the intervention to each example based on the model's likelihood of hallucinating, shows promise, particularly when targeting the model's residuals.

Theoretical and Practical Implications

The paper's rigorous analysis sheds light on the intricacies of deploying steering vectors for hallucination mitigation in LLMs. The observed distinction between classification and generation accuracy underscores the need for a multifaceted approach to evaluating intervention success. Furthermore, the recognition of perplexity as an essential metric highlights the delicate balance between reducing hallucinations and maintaining the model's overall linguistic capabilities. The exploration of intervention strategies in both pre-trained and fine-tuned models opens up new avenues for refining LLM outputs in application-specific contexts.

Future Directions

The work sets the stage for further exploration into the potential of dynamic intervention strategies and the role of model fine-tuning in enhancing intervention outcomes. Additionally, the novel categorization of hallucinations invites future research to explore personalized intervention techniques, tailored not only to specific models but also to individual generation instances.

Concluding Remarks

This comprehensive paper on white-box intervention techniques offers valuable insights into mitigating hallucinations in LLMs, marking a significant step toward more reliable and accurate natural language generation. By dissecting the factors contributing to intervention success and highlighting the importance of context-sensitive approaches, this research contributes to the ongoing development of more robust and trustworthy AI language capabilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
  2. Identifying and controlling important neurons in neural machine translation. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=H1z-PsR5KX.
  3. The poison of alignment. arXiv preprint arXiv:2308.13449, 2023.
  4. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations, 2022.
  5. Do androids know they’re only dreaming of electric sheep? arXiv preprint arXiv:2312.17249, 2023.
  6. Inside: Llms’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2023.
  7. Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  20967–20974, 2024.
  8. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  9. Chain-of-verification reduces hallucination in large language models. ArXiv, abs/2309.11495, 2023. URL https://api.semanticscholar.org/CorpusID:262062565.
  10. TrueTeacher: Learning factual consistency evaluation with large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  2053–2070, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.127. URL https://aclanthology.org/2023.emnlp-main.127.
  11. Llm factoscope: Uncovering llms’ factual discernment through intermediate data analysis. arXiv preprint arXiv:2312.16374, 2023.
  12. Nl-iti: Optimizing probing and intervention for improvement of iti method. arXiv preprint arXiv:2403.18680, 2024.
  13. Janus. Simulators. LessWrong, 2022. Available at https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators.
  14. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  15. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1601–1611, 2017.
  16. Personas as a way to model truthfulness in language models. arXiv preprint arXiv:2310.18168, 2023.
  17. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  18. Calibrated language models must hallucinate. arXiv preprint arXiv:2311.14648, 2023.
  19. Chain of natural language inference for reducing large language model ungrounded hallucinations. arXiv preprint arXiv:2310.03951, 2023.
  20. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
  21. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9004–9017, 2023.
  22. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  23. Pointer sentinel mixture models. In International Conference on Learning Representations, 2016.
  24. Cleo Nardo. The waluigi effect (mega-post). LessWrong, 2023. Available at https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post.
  25. Disentqa: Disentangling parametric and contextual knowledge with counterfactual question answering. arXiv preprint arXiv:2211.05655, 2022.
  26. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  27. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840, 2023.
  28. Weakly supervised detection of hallucinations in llm activations. In Annual Conference on Neural Information Processing Systems, 2023.
  29. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Arianna Bisazza and Omri Abend (eds.), Proceedings of the 25th Conference on Computational Natural Language Learning, pp.  194–209, Online, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.conll-1.15. URL https://aclanthology.org/2021.conll-1.15.
  30. textit dial beinfo for faithfulness: Improving factuality of information-seeking dialogue via behavioural fine-tuning. arXiv preprint arXiv:2311.09800, 2023.
  31. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
  32. Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739, 2023.
  33. Prompting gpt-3 to be reliable. In The Eleventh International Conference on Learning Representations, 2022.
  34. The curious case of hallucinatory (un) answerability: Finding truths in the hidden states of over-confident large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  3607–3625, 2023.
  35. Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pp.  566–581, 2022.
  36. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401, 2023.
  37. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  38. Steering gpt-2-xl by adding an activation vector. LessWrong, 2023. Available at https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector.
  39. " my answer is c": First-token probabilities do not match text answers in instruction-tuned language models. arXiv preprint arXiv:2402.14499, 2024.
  40. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  41. Similarity analysis of contextual word representation models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4638–4655, 2020.
  42. The earth is flat because…: Investigating llms’ belief towards misinformation via persuasive conversation. arXiv preprint arXiv:2312.09085, 2023.
  43. Characterizing truthfulness in large language model generations with local intrinsic dimension. arXiv preprint arXiv:2402.18048, 2024.
  44. Whispers that shake foundations: Analyzing and mitigating false premise hallucinations in large language models. arXiv preprint arXiv:2402.19103, 2024.
  45. Attention satisfies: A constraint-satisfaction lens on factual errors of language models. arXiv preprint arXiv:2309.15098, 2023.
  46. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
  47. Truthx: Alleviating hallucinations by editing large language models in truthful space. arXiv preprint arXiv:2402.17811, 2024.
  48. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Adi Simhi (5 papers)
  2. Jonathan Herzig (34 papers)
  3. Idan Szpektor (47 papers)
  4. Yonatan Belinkov (111 papers)
Citations (6)