Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (2306.03341v6)
Abstract: We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of LLMs. ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
- Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Belinkov, Y. (2016). Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, pages 1–12.
- Robustness of edited neural networks. arXiv preprint arXiv:2303.00046.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
- A mathematical framework for transformer circuits. Transformer Circuits Thread.
- How do new models from openai, deepmind and anthropic perform on truthfulqa. In AI Alignment Forum.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858.
- A framework for few-shot language model evaluation.
- Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. arXiv preprint arXiv:2301.04213.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Measuring and manipulating knowledge representations in language models. arXiv preprint arXiv:2304.00740.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510.
- Editing models with task arithmetic. arXiv preprint arXiv:2212.04089.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Gedi: Generative discriminator guided sequence generation. arXiv preprint arXiv:2009.06367.
- Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
- Li, K. (2023). Do large language models learn world models or just surface statistics? The Gradient.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. In The Eleventh International Conference on Learning Representations.
- Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
- Relative representations enable zero-shot latent space communication. arXiv preprint arXiv:2209.15430.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Olah, C. (2022). Mechanistic interpretability, variables, and the importance of interpretable bases. Transformer Circuits Thread(June 27). http://www. transformer-circuits. pub/2022/mech-interp-essay/index. html.
- Editing implicit assumptions in text-to-image diffusion models. arXiv preprint arXiv:2303.08084.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- Roger, F. (2023). What discovering latent knowledge did and did not find.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.
- Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
- Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124.
- Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html.
- Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Steering gpt-2-xl by adding an activation vector.
- Attention is all you need. Advances in neural information processing systems, 30.
- Language models are open knowledge graphs.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Kenneth Li (11 papers)
- Oam Patel (6 papers)
- Fernanda Viégas (23 papers)
- Hanspeter Pfister (131 papers)
- Martin Wattenberg (39 papers)