Spectral Editing of Activations for Large Language Model Alignment (2405.09719v3)

Published 15 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs often exhibit undesirable behaviours, such as generating untruthful or biased content. Editing their internal representations has been shown to be effective in mitigating such behaviours on top of the existing alignment methods. We propose a novel inference-time editing method, namely spectral editing of activations (SEA), to project the input representations into directions with maximal covariance with the positive demonstrations (e.g., truthful) while minimising covariance with the negative demonstrations (e.g., hallucinated). We also extend our method to non-linear editing using feature functions. We run extensive experiments on benchmarks concerning truthfulness and bias with six open-source LLMs of different sizes and model families. The results demonstrate the superiority of SEA in effectiveness, generalisation to similar tasks, as well as computation and data efficiency. We also show that SEA editing only has a limited negative impact on other model capabilities.

References (42)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces Spectral Editing of Activations (SEA) that improves LLM truthfulness by steering activations toward positive demonstration directions.
It employs Singular Value Decomposition on covariance matrices of activations to develop editing matrices that mitigate hallucinated and biased outputs.
Experimental results on TruthfulQA and BBQ benchmarks demonstrate SEA’s potential for effective, real-time behavior correction in LLMs.

Inference-Time Activation Editing in LLMs: The Spectral Approach

The paper presents a novel method for inference-time editing of LLMs by proposing Spectral Editing of Activations (SEA). This method aims to mitigate undesirable behaviors such as generating untruthful or biased content. The central strategy is to project the model’s internal representations in directions highly correlated with positive demonstrations (i.e., truthful responses) while minimizing correlation with negative demonstrations (i.e., hallucinated responses).

Methodology

Spectral Editing of Activations (SEA)

The SEA framework operates by first tracking the LLM activations during inference for several demonstrations. These demonstrations consist of positive and negative examples to delineate the desired and undesired behaviors, respectively. The core idea is to perform Singular Value Decomposition (SVD) on the covariance matrices derived from these activations:

Positive and neutral activations are used to compute the covariance matrix $\Omega^+$ .
Negative and neutral activations are used to compute the covariance matrix $\Omega^-$ .

SVD is then applied to these covariance matrices to extract the editing projections. The authors develop editing matrices based on the singular vectors that either maximize or minimize the covariance, depending on the desired outcome. For non-linear editing, the method employs an invertible non-linear feature function to transform the activations into a richer space before applying the edits and then transforming the edited activations back into the original space.

Implementation and Experimentation

Two primary attributes, truthfulness and fairness, are the focal points for demonstrating SEA’s efficacy. Extensive experiments were conducted using datasets like TruthfulQA and BBQ to evaluate the improvements in model outputs post-editing. A key observation is that SEA enhances performance on these benchmarks with both linear and non-linear editing approaches, demonstrating significant gains in reducing inaccuracies and biases.

Results

The performance of the SEA method was scrutinized against several baselines, including In-Context Learning (ICL), LoRA Fine-tuning (LoRA-FT), ITI, DoLA, CD, and ICD. The results reveal the superiority of SEA:

Truthfulness: SEA applied to the 7B LLaMA-2-chat model improves the TruthfulQA MC1 score from 36.96 to 39.41 with minimal impact on inference time.
Bias: Non-linear SEA significantly enhances BBQ accuracy, reducing unknown-answer rates and stereotypical response rates.

Additionally, the ablation paper demonstrates that positive and negative editing projections contribute complementary information, and the feature normalization technique is crucial for maintaining coherence in the edited activations.

Analysis and Implications

The spectral analysis uncovers that the top layers of LLMs contain critical information related to truthfulness, making them the optimal targets for editing. The generalization of the proposed method to various LLMs, such as LLaMA-2, Gemma, and Mistral, substantiates its robustness and versatility. However, the decline in performance on control tasks like commonsense reasoning and mathematical tasks underlines the need for further refinement in non-linear editing to avoid fidelity loss.

Practical and Theoretical Insights

Practically, SEA provides a lightweight yet effective method to improve LLM behaviors in real-time without the necessity of full model retraining. This inference-time intervention can be integrated into existing AI systems to enhance output reliability dynamically. Theoretically, the approach sheds light on the internal structure of LLMs, emphasizing the roles of specific layers and activation patterns in generating biased or hallucinated content and proposing targeted interventions to correct such behaviors.

Future Directions

Future research could explore more sophisticated non-linear transformations and their invertibility to preserve more detailed characteristics of the edited activations. Investigating the application of SEA in other domains, such as sentiment analysis or conversation AI, could extend its efficacy and utility. Additionally, integrating SEA with reinforcement learning frameworks to dynamically adapt and optimize the editing process based on feedback could offer further enhancements.

In conclusion, the proposed spectral editing methodology provides a structured, efficient, and generalizable way to steer LLMs towards more accurate and unbiased content generation, marking a significant advancement in the domain of AI behavior correction at inference time.