A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in LLMs
Introduction
The ubiquity and complexity of LLMs have made them a central target for various cyber-attacks. As these models become integrated into a growing number of applications, ensuring their security is paramount. The exploration of both direct attacks on the models and their applications reveals a nuanced cybersecurity landscape. This paper meticulously reviews over 100 pieces of research, exploring different attack vectors, their implementation strategies, and the current state of mitigation techniques. It highlights the ongoing battle between evolving attacker methodologies and the development of robust defenses.
Types of Attacks and Mitigations
Attacks on LLM Applications
The paper categorizes attacks on LLM applications into two primary types: direct and indirect prompt injection attacks. Direct Prompt Injection attacks fool LLMs into generating outputs that contravene their training and intended functionality. This category features attacks like Jailbreak Prompts, Prefix Injection, and Obfuscation. Indirect Prompt Injection attacks manipulate LLM-integrated applications to achieve malicious ends without direct interaction with the LLM itself—an example being URL manipulation to conduct phishing attacks.
Mitigation strategies against these attacks emphasize the need for advanced safety training, data anonymization, strict input-output filtering, and the development of auxiliary safety models. Reinforcement Learning from Human Feedback (RLHF) is cited as a significant method in enhancing model alignment with human values, crucial for negating prompt injection attacks.
Attacks on LLMs Themselves
The paper details four significant attacks targeting LLMs directly: Model Theft, Data Reconstruction, Data Poisoning, and Model Hijacking. Model Theft, a threat to the confidentiality of ML models, involves creating a copy of the model's architecture and parameters. Techniques like Proof of Work (PoW) challenges are presented as potential defenses, aiming to increase the resource costs for attackers. Data Reconstruction and Poisoning exemplify privacy and integrity threats, offering attackers avenues to access or corrupt training data. Here, mitigation can include data sanitization, deduplication, the application of Differential Privacy during the training phase, and robust filtering mechanisms for outputs.
Data Poisoning and Model Hijacking particularly highlight the vulnerability of LLLMs during the training process, where malicious data insertion or manipulation can fundamentally alter a model's behavior. Defensive strategies here are more exploratory, with suggestions around employing "friendly noise" to counteract adversarial perturbations and exploring regularized training methods to enhance resistance against injected poisons.
Future Directions and Conclusions
The paper advocates for a framework to assess the resilience of LLM-integrated applications against both direct and indirect attacks. It also suggests further exploration into the feasibility of novel attack vectors on system messages within LLM-integrated virtual assistants. Significantly, it underscores the importance of viewing cybersecurity in LLMs not as a static goal but as a continuously evolving challenge that requires proactive and innovative defense mechanisms.
In summary, the collective findings and analyses present a detailed examination of the cybersecurity threats facing LLMs and delineate a pathway towards developing more resilient systems. This ongoing research area is crucial for the secure advancement of LLM technologies and their applications across various sectors.