SecAlign: Defending Against Prompt Injection with Preference Optimization

Published 7 Oct 2024 in cs.CR and cs.LG | (2410.05451v3)

Abstract: LLMs are becoming increasingly prevalent in modern software systems, interfacing between the user and the Internet to assist with tasks that require advanced language understanding. To accomplish these tasks, the LLM often uses external data sources such as user documents, web retrieval, results from API calls, etc. This opens up new avenues for attackers to manipulate the LLM via prompt injection. Adversarial prompts can be injected into external data sources to override the system's intended instruction and instead execute a malicious instruction. To mitigate this vulnerability, we propose a new defense called SecAlign based on the technique of preference optimization. Our defense first constructs a preference dataset with prompt-injected inputs, secure outputs (ones that respond to the legitimate instruction), and insecure outputs (ones that respond to the injection). We then perform preference optimization on this dataset to teach the LLM to prefer the secure output over the insecure one. This provides the first known method that reduces the success rates of various prompt injections to <10%, even against attacks much more sophisticated than ones seen during training. This indicates our defense generalizes well against unknown and yet-to-come attacks. Also, SecAlign models are still practical with similar utility to the one before defensive training in our evaluations. Our code is at https://github.com/facebookresearch/SecAlign

Abstract PDF HTML Upgrade to Chat

Summary

The paper proposes StruQ, an innovative alignment method that trains LLMs to favor desirable outputs and reject adversarial injections.
It demonstrates that StruQ reduces GCG-based prompt injection success from 56% to 2% on Mistral-7B, maintaining overall model utility.
Collaboration between UC Berkeley and Meta underpins the work, bridging alignment techniques with AI security for safer large-scale deployments.

Aligning LLMs to Be Robust Against Prompt Injection

The paper "Aligning LLMs to Be Robust Against Prompt Injection" presents an intricate exploration of the vulnerabilities associated with LLMs concerning prompt injection attacks and introduces an advanced method, termed StruQ, aiming to fortify LLMs against such adversarial inputs. The research is conducted by Sizhe Chen et al., and involves collaboration between UC Berkeley and Meta, FAIR.

Overview

LLMs are increasingly integrated into comprehensive software systems, acting as interfaces that facilitate complex tasks involving user data, the internet, and external APIs. While these systems benefit from enhanced capabilities, they are also susceptible to prompt injection attacks, where adversarial prompts within the input data can manipulate the model to execute unintended instructions.

Prompt injection poses a significant security threat, prompting the necessity for effective countermeasures. This paper proposes leveraging LLM alignment as a robust defense mechanism. StruQ employs an alignment strategy involving the creation of a preference dataset comprising simulations of prompt injections and corresponding desirable and undesirable responses. By utilizing existing alignment techniques, StruQ fine-tunes LLMs to withstand these attacks while maintaining operational efficiency.

Key Methodology

The core innovation in StruQ lies in its use of alignment training, a method typically employed to align LLM outputs with human preferences. StruQ adapts this approach to mitigate prompt injections by treating the problem as a preference optimization issue. The authors build a preference dataset using:

Desirable Outputs: Responses to the original, benign instruction.
Undesirable Outputs: Responses to an injected instruction crafted from the alignment dataset.

These components allow the alignment process to optimize the model's response, increasing its robustness against unseen attacks, including complex optimization-based ones like Greedy Coordinate Gradient (GCG).

Experimental Insights

Extensive experiments demonstrate StruQ's efficacy across multiple state-of-the-art LLM architectures, such as Llama-7B, Mistral-7B, and Llama3-8B. Results indicate that StruQ significantly reduces the success rate of prompt injections, particularly against strong attacks, without compromising the model's utility.

For example, StruQ reduces the success rate of GCG-based prompt injections from 56% to a mere 2% in Mistral-7B. The alignment strategy also excels against optimization-free attacks, achieving a 0% success rate in the majority of cases. Utility tests, leveraging AlpacaEval2, confirm that the StruQ methodology preserves model effectiveness when compared to the baseline model performance.

Implications and Future Directions

The implications of this research are substantial for both practical and theoretical advancements in AI security. Practically, StruQ provides a scalable and efficient solution to the widespread issue of prompt injections, enabling safer deployment of LLM-integrated applications. Theoretically, this work bridges the gap between LLM alignment and security, opening up novel pathways for future research.

Potential areas of exploration include refining StruQ's adaptability to other forms of attacks, such as multi-modal prompt injections, and exploring alternate alignment strategies that might further enhance robustness without additional computational costs.

Conclusion

This study represents a significant stride in the domain of AI security, offering both a practical tool for immediate deployment and a framework for ongoing research. By aligning LLMs to human preferences, StruQ emerges as a formidable defense against prompt injections, reflecting a nuanced application of alignment principles to address contemporary security challenges in AI systems.

Markdown