Steering Llama 2 via Contrastive Activation Addition (2312.06681v4)

Published 9 Dec 2023 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Contrastive Activation Addition (CAA), an innovative method for steering LLMs by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user's prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA's effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA's mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in LLMs.

Citations (75)

View on Semantic Scholar

Summary

The paper introduces a novel Contrastive Activation Addition method to steer Llama 2’s outputs toward desired behaviors.
Empirical results demonstrate significant modulation of hallucination and sycophancy using steering vectors across diverse evaluation settings.
The technique leverages contrastive residual activations for precise control, preserving the model’s core capabilities while enhancing alignment.

An Overview of "Steering Llama 2 via Contrastive Activation Addition"

The paper "Steering Llama 2 via Contrastive Activation Addition" introduces a novel technique for precise control of LLM behaviors using a method termed Contrastive Activation Addition (CAA). The method stands out as it contributes to the ongoing efforts in the research community to improve the alignment of LLMs with human intentions, specifically through targeted steering of model outputs during inference.

CAA Methodology and Application

At its core, the CAA technique involves the computation of "steering vectors" derived from contrastive residual stream activations targeting specific behaviors, such as distinguishing between factual and hallucinatory responses. By leveraging the average difference between positive and negative examples associated with a particular behavior, these steering vectors guide the model's behavior. During inference, these vectors are added to residual streams across token positions following the user's prompt, adjusted by a coefficient indicating the desired behavior intensity.

Importantly, the paper undertakes extensive evaluation of CAA's efficacy using datasets consisting of multiple-choice questions and open-ended text generation. Results demonstrate that the method effectively steers LLM outputs, including Llama 2 Chat, to either increase or decrease behaviors of interest, such as hallucination, sycophancy, and corrigibility, among others. Notably, CAA operates over and above established methods like finetuning and system prompt design, without significant reduction in the models' core capabilities.

Insights from Numerical Results

Empirical validations illustrate robust steering effects, with CAA successfully manipulating behaviors across a range of alignment-relevant tasks. For example, in test scenarios with Llama 2 models, CAA reliably alters the incidence of behaviors as evaluated by an external metric, GPT-4 scores. Furthermore, the method's compatibility with reinforcement learning from human feedback (RLHF)—a typical finetuning technique—opens potential for integration with existing model alignment strategies.

The researchers also explore the activation space to understand CAA's mechanism further, utilizing techniques like PCA to project model activations. Results showed clear separability of behaviors at certain model layers, indicating that high-level concept representations emerge naturally in LLMs, which CAA can target effectively.

Implications and Future Directions

The implications of the CAA are manifold, adding value both in theoretical explorations of model alignment and in practical applications. The ability to modify LLM behaviors through controlled activation adjustments may inspire new avenues in AI safety, offering a potentially less resource-intensive alternative to extensive data- or parameter-intensive methods.

Looking forward, the research community could explore expanding CAA to target a broader range of behaviors and generalize across different models and architectures. Additionally, understanding the interaction effects between CAA and other alignment methodologies could refine the precision of AI behavior control further.

Conclusion

In summary, "Steering Llama 2 via Contrastive Activation Addition" makes a significant contribution to the landscape of AI alignment research, bridging theoretical insights with practical steering techniques. CAA offers a scalable and effective solution for the nuanced control of LLM behaviors, presenting a pathway towards achieving more aligned, safe, and reliable AI systems. Through continued exploration and validation, CAA could play a pivotal role in advancing the alignment capabilities of future AI technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/726168841680728064/status/1735384477063319639

https://twitter.com/MarkSchmidty/status/1774926216060903651

https://twitter.com/JQ_Zhu/status/1925331678458999084

https://twitter.com/713866765768372224/status/1739508175030853884

https://twitter.com/aidangch/status/1743198685188792757

https://twitter.com/UnderwaterBepis/status/1788039962442547701