Steering Language Models With Activation Engineering (2308.10248v5)

Published 20 Aug 2023 in cs.CL and cs.LG

Abstract: Prompt engineering and finetuning aim to maximize LLM performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the inference-time modification of activations in order to control (or steer) model outputs. Specifically, we introduce the Activation Addition (ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as "Love" versus "Hate") to compute a steering vector (Subramani et al. 2022). By tactically adding in e.g. the "Love" - "Hate" steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level output properties (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

Citations (89)

View on Semantic Scholar

Summary

The paper introduces Activation Addition (ActAdd), a method for steering LLM outputs by injecting activation adjustments at inference without modifying model weights.
The method is validated on multiple LLMs including GPT-2, Llama-13B, and GPT-J-6B, demonstrating precise steering and minimal impact on general performance.
ActAdd offers a scalable, low-overhead approach for real-time model control, enhancing user interaction and safety without the computational costs of traditional optimization.

Review of "Activation Addition: Steering LLMs Without Optimization"

The paper conducted by Turner et al. introduces an innovative method called Activation Addition (ActAdd) for controlling the behavior of LLMs such as GPT-2, Llama-13B, and GPT-J-6B. This method diverges from traditional optimization techniques by modifying model activations at inference time, thereby altering output without changing the underlying model weights.

The paper addresses the critical challenge of effectively steering LLMs with minimal computational cost. Existing control methods like supervised finetuning, reinforcement learning from human feedback, and prompt engineering involve substantial computational overhead. In contrast, ActAdd operates at inference time through activation engineering, which is both efficient and scalable with model size.

ActAdd Methodology

ActAdd involves creating a "steering vector" derived from the difference in model activations generated by pairs of prompts representing opposite properties (e.g., “Love – Hate”). This process does not necessitate optimization steps, such as gradient descent or backward passes, making it computationally lightweight and user-friendly.

The method utilizes the inherent structure of transformer models which process inputs as sequences of high-dimensional activation vectors. By introducing calculated vectors at inference, models can be directed towards specific properties like sentiment or topic, without degrading general performance.

Experimental Validation

Turner et al. demonstrate the effectiveness of ActAdd across several LLMs. Notably, they establish:

Effectiveness and Specificity: ActAdd successfully steers GPT-2-XL to alter its textual output on topics such as weddings, anger, and conspiracies, showcasing precision in activation-derived transformations.
Performance Preservation: Through experiments with the ConceptNet dataset, the team illustrates that ActAdd minimally impacts the model's general knowledge outputs, ensuring that functional capabilities remain intact even when steered.
Scalability: The computational overhead introduced by ActAdd is minimal and consistent across models of varying sizes, emphasizing its applicability to larger, more contemporary models.

Theoretical Insights

The findings provide compelling evidence supporting the hypothesis that LLMs represent features linearly within activation space, with meaningful directions being causally involved in text generation. This insight aligns with and extends previous understanding within mechanistic interpretability, wherein linear probes indicate representational features, and activation vectors confirm their functionality.

Implications and Future Work

ActAdd represents a promising direction for user-level interaction with LLMs, where inference-time steering can supplement or replace finetuning and prompt engineering methods. Its immediate utility lies in settings requiring rapid behavioral adjustments, as well as in scenarios where model interpretability and safety are crucial.

Looking forward, this work may significantly influence AI safety by providing efficient mechanisms for adjusting model behavior moment-to-moment, enhancing alignment without the performance costs associated with conventional finetuning. Future research should explore the application of ActAdd to models engaged in complex reasoning tasks and consider its potential in circumventing superficial alignment solutions.

In conclusion, Activation Addition stands as a robust method for modifying LLM behavior with an elegant and computationally economical approach, meriting continued exploration and development within contemporary AI research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/teortaxesTex/status/1766724762573611053

https://twitter.com/CFGeek/status/1769755703177060784

https://twitter.com/ITimiryasov/status/1794500865581273313

https://twitter.com/teilomillet/status/1853588147864469758

https://twitter.com/mctalentowen/status/1799840804137292041

YouTube

Show All Videos