Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space (2402.17811v2)

Published 27 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.

Enhancing LLM Truthfulness with TruthX: Editing Internal Representations in Truthful Space

Introduction to TruthX

LLMs have grown significantly in prominence, performing a wide array of tasks with noteworthy fluency and comprehension. Despite these advancements, LLMs are prone to generating responses that are not anchored in truth, a phenomenon typically known as "hallucination." Addressing this challenge, we introduce TruthX, a novel method designed to enhance the truthfulness of LLMs. TruthX operates by editing LLMs' internal representations in a domain we define as the "truthful space." This space is meticulously crafted to distinguish between truthful and hallucinatory content, thereby nudging LLM responses towards accuracy.

TruthX: Mechanisms and Techniques

TruthX innovatively employs an auto-encoder structure that decouples LLM internal representations into "truthful" and "semantic" latent spaces. The method utilizes contrastive learning paradigms to explore these spaces, actively identifying the direction that enhances truthfulness without compromising the model's inherent generative capabilities. During inference, TruthX applies directional edits within the truthful space to adjust an LLM’s responses to be more aligned with factual correctness.

Experimental Validation

Extensive experiments demonstrate that TruthX significantly improves the truthfulness of responses from various LLMs. On the TruthfulQA benchmark, TruthX exhibited an average enhancement of 20% in truthfulness across thirteen advanced LLMs. Additionally, analyses indicate that TruthX preserves the generative capabilities of LLMs, addressing concerns that enhancing truthfulness might lead to diminished linguistic fluency or relevance.

Comparative Advantages and Innovations

Compared to existing truthfulness-enhancement techniques, such as contrast decoding and representation editing, TruthX stands out by:

  • Offering a holistic approach that modifies both attention and feed-forward neural network modules within LLMs.
  • Introducing a novel concept of "truthful space," distinctively separated from semantic considerations, to focus purely on truthfulness editing.
  • Demonstrating superior performance in truthfulness enhancement without negatively impacting the LLM's ability to generate coherent and contextually appropriate responses.

Implications and Future Directions

The implications of TruthX are multifaceted, extending beyond improving the reliability of LLM outputs to contributing foundational insights into the workings and optimizations of LLMs. The concept of editing in a domain-specific latent space opens new avenues for AI research, particularly in areas where accuracy and factuality are paramount.

Furthermore, the cross-LLM generalizability of TruthX, especially among sequentially-trained models, demonstrates its broad applicability, potentially paving the way for universal truthfulness-enhancement solutions adaptable across different architectures and applications. Future work will further explore integrating external knowledge sources with internal representation editing to amplify LLM reliability and usefulness across even more diverse scenarios.

In conclusion, TruthX represents a significant step forward in refining the truthfulness of LLM outputs, ensuring that these models not only generate human-like text but also adhere closely to factual accuracy. This advancement holds promise for a wide range of applications, from enhancing information veracity in real-time interactions to improving the quality of generated content across digital platforms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Guillaume Alain and Yoshua Bengio. 2017. Understanding intermediate layers using linear classifier probes.
  2. Amos Azaria and Tom Mitchell. 2023. The internal state of an LLM knows when it’s lying. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore. Association for Computational Linguistics.
  3. Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
  4. Yonatan Belinkov. 2022. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219.
  5. Robustness of edited neural networks. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
  6. Discovering latent knowledge in language models without supervision. In The Eleventh International Conference on Learning Representations.
  7. Truth forest: Toward multi-scale truthfulness in large language models through intervention without tuning.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  9. Dola: Decoding by contrasting layers improves factuality in large language models.
  10. Plug and play language models: A simple approach to controlled text generation. In International Conference on Learning Representations.
  11. Chain-of-verification reduces hallucination in large language models.
  12. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  13. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space.
  14. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  15. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  16. Inspecting and editing knowledge representations in language models.
  17. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  18. Mistral 7b.
  19. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  20. Language models (mostly) know what they know.
  21. Sh2: Self-highlighted hesitation helps you decode more truthfully.
  22. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  23. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations.
  24. Inference-time intervention: Eliciting truthful answers from a language model.
  25. Contrastive decoding: Open-ended text generation as optimization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12286–12312, Toronto, Canada. Association for Computational Linguistics.
  26. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  27. Second thoughts are best: Learning to re-align with human values from text edits. In Advances in Neural Information Processing Systems, volume 35, pages 181–196. Curran Associates, Inc.
  28. Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.
  29. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems.
  30. Generating benchmarks for factuality evaluation of language models. arXiv preprint arXiv:2307.06908.
  31. OpenAI. 2022. Introducing chatgpt.
  32. OpenAI. 2023. Gpt-4 technical report.
  33. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  34. Self-critiquing models for assisting human evaluators.
  35. Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  36. Extracting latent steering vectors from pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 566–581, Dublin, Ireland. Association for Computational Linguistics.
  37. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  38. What makes for good views for contrastive learning? In Advances in Neural Information Processing Systems, volume 33, pages 6827–6839. Curran Associates, Inc.
  39. Llama: Open and efficient foundation language models.
  40. Llama 2: Open foundation and fine-tuned chat models.
  41. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
  42. Auto-encoder based dimensionality reduction. Neurocomputing, 184:232–242. RoLoD: Robust Local Descriptors for Computer Vision 2014.
  43. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  44. Bayling: Bridging cross-lingual alignment and instruction following through interactive translation for large language models. arXiv preprint arXiv:2306.10968.
  45. Alleviating hallucinations of large language models through induced hallucinations.
  46. Siren’s song in the ai ocean: A survey on hallucination in large language models.
  47. Explainability for large language models: A survey. ACM Trans. Intell. Syst. Technol. Just Accepted.
  48. Representation engineering: A top-down approach to ai transparency.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shaolei Zhang (36 papers)
  2. Tian Yu (24 papers)
  3. Yang Feng (230 papers)
Citations (23)
Github Logo Streamline Icon: https://streamlinehq.com