Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (2306.03341v6)

Published 6 Jun 2023 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of LLMs. ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

Inference-Time Intervention: Eliciting Truthful Answers from a LLM

The paper presents a method called Inference-Time Intervention (ITI), aimed at enhancing the truthfulness of LLMs. This technique modifies the model's activation during inference to improve the accuracy of its responses on truth-oriented benchmarks. Specifically, ITI demonstrates substantial improvements in the performance of LLaMA models on TruthfulQA, a challenging benchmark designed to test truthfulness, increasing the accuracy of the Alpaca variant from 32.5% to 65.1%.

Methodology

ITI aligns the model's activations during inference with a selected direction correlated with truthfulness. The authors leverage the inherent latent knowledge within LLMs, hypothesizing that while models may produce inaccurate outputs, they possess a nuanced internal representation of truth. To implement ITI, the authors identify and adjust a limited number of attention heads within the LLMs that are determined to be highly indicative of truthful responses.

Key Findings

The application of ITI yields a 40% improvement by bridging the gap between the model's probing accuracy and its generation accuracy. This significant performance boost is achieved without the extensive resource demands typical of other methods like Reinforcement Learning with Human Feedback (RLHF). ITI requires only a minimally invasive modification and uses far fewer training samples.

Comparison with Other Methods

The ITI method is contrasted with established techniques such as supervised fine-tuning and few-shot prompting, as well as more computationally intensive approaches like RLHF. ITI shows formidable improvements over these methods, particularly on the TruthfulQA benchmark. Where RLHF demands enormous computational resources and data, ITI offers a substantially less resource-intensive alternative without sacrificing efficacy.

Generalization and Implications

The paper explores ITI’s generalization potential across datasets with results extending to Natural Questions, TriviaQA, and MMLU. These findings suggest that ITI’s benefits may not be restricted to its original benchmark and imply a broader applicability to other truth-based tasks.

Future Directions

While ITI presents a noteworthy improvement in truthfulness, the paper notes challenges in balancing truthfulness with informativeness. The choice of intervention strength is crucial, impacting the model's overall utility. Future work might focus on refining these interventions to optimize this trade-off and potentially automate the discovery of truthful directions without supervised data.

Moreover, the exploration of mechanistic interpretability within the framework of ITI could shed light on the internal processes of LLMs during inference. Understanding the causal implications of these interventions provides an exciting avenue for future research.

This research offers a compelling strategy for enhancing the reliability of LLM outputs. By addressing the innate tension between truthfulness and helpdesk utility, ITI contributes significantly to the ongoing discourse on LLM alignment and controllability, presenting methodologies that could be incorporated into more extensive systems aimed at ensuring model reliability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  5. Belinkov, Y. (2016). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, pages 1–12.
  6. Robustness of edited neural networks. arXiv preprint arXiv:2303.00046.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  11. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  12. A mathematical framework for transformer circuits. Transformer Circuits Thread.
  13. How do new models from openai, deepmind and anthropic perform on truthfulqa. In AI Alignment Forum.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
  15. A framework for few-shot language model evaluation.
  16. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. arXiv preprint arXiv:2301.04213.
  17. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  18. Measuring and manipulating knowledge representations in language models. arXiv preprint arXiv:2304.00740.
  19. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  21. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510.
  22. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
  23. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr.
  24. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
  25. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  26. Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367.
  27. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  28. Li, K. (2023). Do large language models learn world models or just surface statistics? The Gradient.
  29. Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations.
  30. Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217.
  31. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  32. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
  33. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
  34. Relative representations enable zero-shot latent space communication. arXiv preprint arXiv:2209.15430.
  35. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  36. Olah, C. (2022). Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread(June 27). http://www. transformer-circuits. pub/2022/mech-interp-essay/index. html.
  37. Editing implicit assumptions in text-to-image diffusion models. arXiv preprint arXiv:2303.08084.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  39. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  40. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
  41. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444.
  42. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  43. Roger, F. (2023). What discovering latent knowledge did and did not find.
  44. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
  45. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
  46. Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124.
  47. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html.
  48. Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
  49. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  50. Steering gpt-2-xl by adding an activation vector.
  51. Attention is all you need. Advances in neural information processing systems, 30.
  52. Language models are open knowledge graphs.
  53. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  54. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  55. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
  56. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kenneth Li (11 papers)
  2. Oam Patel (6 papers)
  3. Fernanda Viégas (23 papers)
  4. Hanspeter Pfister (131 papers)
  5. Martin Wattenberg (39 papers)
Citations (344)
Youtube Logo Streamline Icon: https://streamlinehq.com