Activation Scaling for Steering and Interpreting Language Models (2410.04962v1)

Published 7 Oct 2024 in cs.CL and cs.AI

Abstract: Given the prompt "Rome is in", can we steer a LLM to flip its prediction of an incorrect token "France" to a correct token "Italy" by only multiplying a few relevant activation vectors with scalars? We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings. Concretely, we establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa (effectiveness), and leave other tokens unaffected (faithfulness), all while being sparse (minimality). Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention: activation scaling only modifies the signed magnitude of activation vectors to strengthen, weaken, or reverse the steering directions already encoded in the model. On synthetic tasks, this intervention performs comparably with steering vectors in terms of effectiveness and faithfulness, but is much more minimal allowing us to pinpoint interpretable model components. We evaluate activation scaling from different angles, compare performance on different datasets, and make activation scalars a learnable function of the activation vectors themselves to generalize to varying-length prompts.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces ActivScalar, a method that scales activation magnitudes to flip predictions effectively while leaving unrelated tokens unaffected.
It utilizes gradient-based optimization to achieve interventions that are effective, faithful, and minimal by targeting specific model components.
Experimental results with GPT-2 and Pythia-1.4b demonstrate that ActivScalar offers parameter-efficient, interpretable steering for both static and dynamic prompts.

Activation Scaling for Steering and Interpreting LLMs

In "Activation Scaling for Steering and Interpreting LLMs," the authors introduce a method called activation scaling (ActivScalar) to steer the behavior of transformer-based LLMs while maintaining interpretability. This paper addresses a fundamental challenge in mechanistic interpretability: understanding which components of a LLM play influential roles in determining specific outputs and using this understanding for model steering.

Conceptual Framework

The authors propose a three-term objective governing effective steering interventions in LLMs. These interventions should be:

Effective: Capable of flipping a model's prediction between competing tokens.
Faithful: Leaving unrelated tokens unaffected.
Minimal: Effectuating changes through sparsity, modifying only necessary components.

Using gradient-based optimization, they fine-tune these interventions towards optimal steering performance. ActivScalar modifies the model by scaling the signed magnitude of specific activation vectors, thereby naturally aligning with the pre-existing structure within the model without altering the vectors' direction.

Methodology

The authors detail two types of interventions:

Steering Vectors (SteerVec): An additive approach that alters the direction and magnitude of activation vectors.
Activation Scalars (ActivScalar): A multiplicative approach, more parameter-efficient, focusing on scaling existing activation vectors.

The authors evaluate these methods against two synthetic tasks: handling factual knowledge inconsistencies and indirect object identification, using Pythia-1.4b and GPT-2 models as test beds.

Experimental Results

ActivScalar achieves effectiveness comparable to SteerVec while being more efficient and interpretable. ActivScalar typically outperforms in scenarios where minimality is crucial, as explored through effectiveness-faithfulness and effectiveness-minimality trade-offs. The authors highlight the importance of focusing on specific activation sites within the transformer architecture, such as MLP output layers, to achieve meaningful and concise interventions.

Analyses reveal that mere scaling of activation magnitudes can effectively reorient a model without drastically changing its internal pathways, providing a finer control over the intervention process.

Interpretability and Generalization

A notable contribution of this work is demonstrating the interpretability inherent to scaling interventions. For ActivScalar, visualization shows a clear, interpretable map of model components relevant to the task, highlighting critical layers and token positions.

The extension of activation scalars via dynamic activation scalars (DynActivScalar) allows for generalization across varying prompt lengths, showcasing the method's adaptability in dynamic contexts. Their approach elucidates the scalability of model steering interventions, making them applicable even in non-static prompt situations.

Implications and Future Directions

This research provides practical insights into how transformer models can be both steered and interpreted through activation-level interventions. By maintaining an effective balance of interpretability and parameter efficiency, ActivScalar emerges as a promising tool for both AI deployment and mechanistic insight.

The implications of this work touch on the broader landscape of AI steering and interpretability, offering a potential path for integrating mechanistic insights with functional adjustments in real-world applications. Future explorations could extend ActivScalar to larger model architectures and more complex datasets, potentially examining integrations with other interpretability tools like activation patching and direct logit attribution. Further rigorous testing on real-world tasks would cement the practical implementations of this theoretical framework.

PDF Markdown

Related Papers

Tweets

https://twitter.com/niklas_stoehr/status/1843949444045357292