Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 82 tok/s

Gemini 2.5 Pro 43 tok/s Pro

GPT-5 Medium 30 tok/s

GPT-5 High 32 tok/s Pro

GPT-4o 95 tok/s

GPT OSS 120B 469 tok/s Pro

Kimi K2 212 tok/s Pro

2000 character limit reached

Refusal in LLMs is an Affine Function (2411.09003v3)

Published 13 Nov 2024 in cs.LG and cs.CL

Abstract: We propose affine concept editing (ACE) as an approach for steering LLMs' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and use it to control refusal behavior on ten different models, including Llama 3 70B. ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at https://github.com/EleutherAI/steering-llama3 .

Collections

Summary

The paper introduces Affine Concept Editing (ACE) to control and standardize refusal behaviors in large language models.
It demonstrates that incorporating bias through affine functions improves predictability compared to purely linear methods.
ACE shows cross-model effectiveness, yielding coherent refusal responses in models like Llama 3 70B and RWKV v5.

Refusal in LLMs is an Affine Function

The paper "Refusal in LLMs is an Affine Function" introduces Affine Concept Editing (ACE) as a novel approach to modify LLM behaviors, specifically focusing on refusal behaviors in certain contexts. The methodology developed is based on the hypothesis that concepts in neural networks can be represented as linear or affine functions within the network's activation space. Building on existing techniques such as directional ablation and activation additions, the authors present ACE as a more generalized and potentially accurate method for steering model behavior.

Overview

The authors begin with an examination of previous methods that aim to manipulate LLM behaviors by intervening in the activation vectors of these models. A novel contribution of this paper is a critique of the linear representation hypothesis, citing it as potentially limited due to its assumption of a zero-origin default for concept representations. They argue instead for an affine function perspective, which allows for a constant (bias) term, potentially addressing the shortcomings of linear-only models.

In this paper, ACE is derived and utilized to control refusal behaviors in a variety of models. The authors apply ACE to several architectures, including Llama 3 70B and RWKV v5, examining its ability to steer refusals effectively. By combining affine subspace projection with activation addition, ACE purportedly offers more deterministic control over refusals across divergent prompt types and shows promise in generalizing refusal behavior better than existing methods.

Key Findings

Affine Decomposition: The paper underscores the importance of distinguishing between linear and affine representations. Whereas linear methods may oversimplify the default state of a model's activation for behaviors, the ACE approach considers both linear vectors and bias, resulting in increased fidelity to desired outcomes in model steering.
Standardized Steering: ACE demonstrates a higher degree of behavior standardization compared to Contrastive Activation Addition (CAA) alone and directional ablation. By setting this framework within an affine structure, ACE operates with steeper predictability regarding refusal responses.
Model Generalization: A critical element of the methodology is ACE's cross-model applicability. While directional ablation tends to lead to degenerate outputs in certain architectures, such as RWKV v5, ACE maintains coherence.
Threshold Adjustments: The authors observe that precise steering is often achieved with ACE parameters slightly outside the standard range of zero to one, pointing to a parameter tuning process crucial for optimizing behavior control.

Implications and Future Directions

The implications of this research are notable in the context of ethical AI applications, where deterministic refusal behaviors can prevent models from providing harmful outputs. ACE's ability to finely tune these behaviors represents a step toward more reliable and predicable AI systems.

This paper opens several avenues for future research. For one, exploring nonlinear modifications to further refine the control over LLMs could potentially improve on the results gained via ACE. Moreover, expanding the breadth of behaviors influenced by ACE could contribute to a more holistic understanding of behavior modification in LLMs.

In conclusion, the concept of Affine Concept Editing represents a meaningful conceptual advance over basic linear manipulations in neural network activations, offering substantial improvements in controlled use-cases such as refusal behavior in LLMs. The approach addresses significant limitations of preceding methods and sets a foundation for future exploration into more complex task-oriented model steering techniques.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Authors (3)

GitHub

GitHub - EleutherAI/steering-llama3 (3 stars)

Tweets

https://twitter.com/norabelrose/status/1859307287112007896

https://twitter.com/advadnoun/status/1860596949776470087

https://twitter.com/actualhog/status/1867018538625712435

https://twitter.com/GptMaestro/status/1860260484655776249