Circuit Breaking: Removing Model Behaviors with Targeted Ablation (2309.05973v2)

Published 12 Sep 2023 in cs.CL and cs.LG

Abstract: LLMs often exhibit behaviors that improve performance on a pre-training objective but harm performance on downstream tasks. We propose a novel approach to removing undesirable behaviors by ablating a small number of causal pathways between model components, with the intention of disabling the computational circuit responsible for the bad behavior. Given a small dataset of inputs where the model behaves poorly, we learn to ablate a small number of important causal pathways. In the setting of reducing GPT-2 toxic language generation, we find ablating just 12 of the 11.6K causal edges mitigates toxic generation with minimal degradation of performance on other inputs.

Citations (22)

View on Semantic Scholar

Summary

The paper demonstrates that targeted edge ablation can reduce toxicity by disabling only 12 out of 11.6K edges in a model's graph.
It employs a graph-based method with a learned ablation mask to selectively disable undesirable pathways without affecting other tasks.
The approach outperforms fine-tuning baselines, offering a nuanced framework for mitigating unwanted behaviors in language models.

Circuit Breaking: Removing Model Behaviors with Targeted Ablation

Introduction

Recent advancements in LMs have highlighted a critical challenge—the persistence of undesirable behaviors in models post-training that cannot easily be removed through traditional fine-tuning methods. The paper by Li, Davies, and Nadeau addresses this issue by introducing a novel technique called "Targeted Edge Ablation." This method focuses on ablating a small number of causal pathways in a model's computational graph to disable specific undesirable behaviors effectively.

Methodology

The core of their approach lies in identifying and then disabling specific edges within the model's computational graph responsible for undesirable behaviors.

Circuit Analysis and Model Representation: The first step involves representing the model as a computational graph where nodes correspond to computational units (e.g., attention heads in transformers), and edges represent dependencies between these units.
Targeted Edge Ablation: The process is threefold:
1. Graph Granularity Selection: Deciding the granularity level for representing the model’s computation, affecting the specificity and efficacy of ablation.
2. Ablation Mask Learning: A binary mask is learned over the graph edges, optimized to disable undesirable behaviors while causing minimal collateral damage to model performance on other tasks.
3. Inference Time Ablation: Applying the learned ablation mask during model inference to ensure the undesirable behavior is mitigated.

This technique uses both zero and mean ablation strategies for disabling edges, with an emphasis on preserving the model's original performance on unrelated tasks.

Experiments and Results

The paper meticulously applies their approach to minimizing toxicity in GPT-2 outputs, demonstrating the method's practical efficacy. By selectively ablation only 12 edges out of 11.6K in GPT-2's computational graph, the researchers significantly reduced the model's propensity to generate toxic content with minimal performance degradation on non-toxic content generation tasks.

Comparison with Baselines: The targeted edge ablation method outperformed both fine-tuning and task arithmetic baselines in achieving a balance between reducing toxicity and maintaining overall model performance.
Evaluation Metrics: Efficacy was measured in terms of increased loss on toxic content, while specificity was evaluated through the model's performance on a holdout set and the generation of non-toxic responses to toxic inputs.

Discussion

The introduction of targeted edge ablation presents a compelling alternative to conventional LM behavior modification techniques like fine-tuning. By maintaining the model's learned knowledge structure and only modifying the causal pathways that lead to undesirable outputs, this method offers a more nuanced approach to editing LMs.

Conceptual Advantages: This method limits the expressivity of solutions to avoid overfitting on specific examples of undesired behaviors and preserves the model's structural and mechanistic interpretability.
Practical Implications: Beyond addressing toxicity in LLMs, the framework has broader applicability in removing various types of undesirable behaviors from complex models without significant retraining.

Future Directions

While the results are promising, the approach opens several avenues for future research, such as exploring different graph granularities, improving the efficiency of edge mask learning, and extending the methodology to other model architectures beyond LMs.

Conclusion

The paper introduces a pioneering method for mitigating undesirable behaviors in LLMs through targeted edge ablation. This approach's ability to fine-tune model outputs at a granular level, without compromising overall model utility, marks a significant step forward in developing more reliable and socially responsible AI systems.