Inference-Time Intervention: Eliciting Truthful Answers from a Language Model (2306.03341v6)

Published 6 Jun 2023 in cs.LG, cs.AI, and cs.CL

Abstract: We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of LLMs. ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.

PDF HTML Abstract

Inference-Time Intervention: Eliciting Truthful Answers from a LLM

The paper presents a method called Inference-Time Intervention (ITI), aimed at enhancing the truthfulness of LLMs. This technique modifies the model's activation during inference to improve the accuracy of its responses on truth-oriented benchmarks. Specifically, ITI demonstrates substantial improvements in the performance of LLaMA models on TruthfulQA, a challenging benchmark designed to test truthfulness, increasing the accuracy of the Alpaca variant from 32.5% to 65.1%.

Methodology

ITI aligns the model's activations during inference with a selected direction correlated with truthfulness. The authors leverage the inherent latent knowledge within LLMs, hypothesizing that while models may produce inaccurate outputs, they possess a nuanced internal representation of truth. To implement ITI, the authors identify and adjust a limited number of attention heads within the LLMs that are determined to be highly indicative of truthful responses.

Key Findings

The application of ITI yields a 40% improvement by bridging the gap between the model's probing accuracy and its generation accuracy. This significant performance boost is achieved without the extensive resource demands typical of other methods like Reinforcement Learning with Human Feedback (RLHF). ITI requires only a minimally invasive modification and uses far fewer training samples.

Comparison with Other Methods

The ITI method is contrasted with established techniques such as supervised fine-tuning and few-shot prompting, as well as more computationally intensive approaches like RLHF. ITI shows formidable improvements over these methods, particularly on the TruthfulQA benchmark. Where RLHF demands enormous computational resources and data, ITI offers a substantially less resource-intensive alternative without sacrificing efficacy.

Generalization and Implications

The paper explores ITI’s generalization potential across datasets with results extending to Natural Questions, TriviaQA, and MMLU. These findings suggest that ITI’s benefits may not be restricted to its original benchmark and imply a broader applicability to other truth-based tasks.

Future Directions

While ITI presents a noteworthy improvement in truthfulness, the paper notes challenges in balancing truthfulness with informativeness. The choice of intervention strength is crucial, impacting the model's overall utility. Future work might focus on refining these interventions to optimize this trade-off and potentially automate the discovery of truthful directions without supervised data.

Moreover, the exploration of mechanistic interpretability within the framework of ITI could shed light on the internal processes of LLMs during inference. Understanding the causal implications of these interventions provides an exciting avenue for future research.

This research offers a compelling strategy for enhancing the reliability of LLM outputs. By addressing the innate tension between truthfulness and helpdesk utility, ITI contributes significantly to the ongoing discourse on LLM alignment and controllability, presenting methodologies that could be incorporated into more extensive systems aimed at ensuring model reliability.