Improving Alignment and Robustness with Circuit Breakers (2406.04313v4)

Published 6 Jun 2024 in cs.LG, cs.AI, cs.CL, cs.CV, and cs.CY

Abstract: AI systems can take harmful actions and are highly vulnerable to adversarial attacks. We present an approach, inspired by recent advances in representation engineering, that interrupts the models as they respond with harmful outputs with "circuit breakers." Existing techniques aimed at improving alignment, such as refusal training, are often bypassed. Techniques such as adversarial training try to plug these holes by countering specific attacks. As an alternative to refusal training and adversarial training, circuit-breaking directly controls the representations that are responsible for harmful outputs in the first place. Our technique can be applied to both text-only and multimodal LLMs to prevent the generation of harmful outputs without sacrificing utility -- even in the presence of powerful unseen attacks. Notably, while adversarial robustness in standalone image recognition remains an open challenge, circuit breakers allow the larger multimodal system to reliably withstand image "hijacks" that aim to produce harmful content. Finally, we extend our approach to AI agents, demonstrating considerable reductions in the rate of harmful actions when they are under attack. Our approach represents a significant step forward in the development of reliable safeguards to harmful behavior and adversarial attacks.

Citations (24)

View on Semantic Scholar

Summary

The paper introduces a novel short-circuiting method that remaps harmful representations to improve AI alignment and robustness.
The methodology employs a rerouting loss on designated harmful representations to significantly reduce adversarial attack success rates in LLMs and multimodal models.
Experiments on models like Mistral-7B and Llama-3-8B demonstrate marked performance gains under extreme adversarial scenarios, validating the approach’s practical impact.

Improving Alignment and Robustness with Short Circuiting

The paper "Improving Alignment and Robustness with Short Circuiting" introduces an innovative method termed "short-circuiting," aiming to enhance the alignment and robustness of AI models, particularly in LLMs and multimodal systems. Developed by Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks, the approach addresses inherent vulnerabilities in AI systems to adversarial attacks and seeks to prevent the generation of harmful outputs without compromising the models' utility.

Short-Circuiting Concept

The core idea of short-circuiting is derived from representation engineering approaches. Traditional methods, such as refusal training and adversarial training, have been utilized to improve model alignment. However, these methods are often bypassed by sophisticated attacks, raising concerns about the feasibility of deploying AI systems with a high standard of safety and reliability. Instead of countering specific attacks, short-circuiting focuses on directly controlling the internal representations that give rise to harmful outputs.

Methodology

Short-circuiting involves remapping representations linked to harmful processes to prevent the model from generating undesirable outputs. This method utilizes a training procedure that differentiates between a "Short Circuit Set," containing harmful representations, and a "Retain Set," which includes benign representations to be preserved. The training process applies a rerouting loss to harmful representations to redirect them to an orthogonal space, thereby interrupting harmful behaviors even under strong adversarial pressure.

Experimental Results

The authors validate the efficacy of short-circuiting through extensive experiments on LLMs and multimodal models. They utilized two models for testing: Mistral-7B-Instruct-v2 and Llama-3-8B-Instruct. Applying short-circuiting to these models, the authors reported significant improvements in alignment and robustness across a variety of adversarial attack scenarios.

LLMs

For LLMs, the research demonstrated that short-circuiting markedly lowers attack success rates (ASR) across unseen attacks while maintaining the model's capabilities. For instance, the short-circuited Mistral-7B model achieved an average ASR of 7.0% compared to much higher rates for refusal-trained and adversarial-trained baselines. Similarly, the Llama-3-8B model, when equipped with short-circuiting, showed robust performance with minimal reductions in benchmarks such as MT-Bench and OpenLLM scores.

Multimodal Models

In multimodal settings, short-circuiting was tested with the LLaVA-NeXT-Mistral-7B model. The results showed a notable enhancement in robustness to adversarial image-text attacks, particularly against the Projected Gradient Descent (PGD) attack. The ASR dropped from 91% to 14.3% under PGD attack while maintaining performance on key evaluations.

Practical and Theoretical Implications

The introduction of short-circuiting has substantial implications for the development and deployment of AI systems. Practically, it offers a reliable method for mitigating harmful outputs in AI applications, enhancing trustworthiness in real-world scenarios. Theoretically, this approach highlights the importance of internal representation control over traditional output supervision methods. It establishes a new paradigm where AI models can be intrinsically designed to avoid harmful behaviors, thereby shifting the focus from reactive to proactive defense mechanisms.

Future Developments

This research opens multiple avenues for future exploration. Notably, further refinement of the short-circuiting technique to handle more complex scenarios and sustained real-time applications could be significant. Additionally, expanding the applicability of short-circuiting to other AI paradigms, such as reinforcement learning agents and autonomous systems, is a promising direction. Continued improvement in computational efficiency and ease of integration with existing AI models would also be beneficial.

Conclusion

The paper "Improving Alignment and Robustness with Short Circuiting" presents a promising advancement in the field of AI safety and robustness. Through meticulous experimentation and a novel approach to representation control, the authors have demonstrated how AI systems can be fundamentally enhanced to resist adversarial attacks and prevent harmful outputs, paving the way for more reliable and trusted AI applications in various domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/DanHendrycks/status/1799095730583515485

https://twitter.com/andyzou_jiaming/status/1799232319250743561

https://twitter.com/DanHendrycks/status/1832653587555742199

https://twitter.com/rohanpaul_ai/status/1822809418259472494

https://twitter.com/kenziyuliu/status/1833664475049996661

https://twitter.com/andyzou_jiaming/status/1799232335931396155