Vocabulary Attack to Hijack Large Language Model Applications (2404.02637v2)
Abstract: The fast advancements in LLMs are driving an increasing number of applications. Together with the growing number of users, we also see an increasing number of attackers who try to outsmart these systems. They want the model to reveal confidential information, specific false information, or offensive behavior. To this end, they manipulate their instructions for the LLM by inserting separators or rephrasing them systematically until they reach their goal. Our approach is different. It inserts words from the model vocabulary. We find these words using an optimization procedure and embeddings from another LLM (attacker LLM). We prove our approach by goal hijacking two popular open-source LLMs from the Llama2 and the Flan-T5 families, respectively. We present two main findings. First, our approach creates inconspicuous instructions and therefore it is hard to detect. For many attack cases, we find that even a single word insertion is sufficient. Second, we demonstrate that we can conduct our attack using a different model than the target model to conduct our attack with.
- OpenAI “GPT-4 Technical Report” version 3, 2023 DOI: 10.48550/arXiv.2303.08774
- “Scaling Instruction-Finetuned Language Models” version 5 arXiv, 2022 DOI: 10.48550/ARXIV.2210.11416
- “LLaMA: Open and Efficient Foundation Language Models” version 1, 2023 arXiv:2302.13971 [cs.CL]
- “Llama 2: Open Foundation and Fine-Tuned Chat Models” version 2, 2023 DOI: 10.48550/arXiv.2307.09288
- Patrick Sabau and Christoph P. Neumann “Analyse von Methoden zur Sicherung der Vertraulichkeit in Neuronalen Netzen”, 2024 DOI: 10.13140/RG.2.2.21052.65924
- “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study” version 1, 2023 DOI: 10.48550/arXiv.2305.13860
- “"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models” version 1, 2023 DOI: 10.48550/arXiv.2308.03825
- “Ignore Previous Prompt: Attack Techniques For Language Models” version 1, 2022 DOI: 10.48550/arXiv.2211.09527
- “Survey of Hallucination in Natural Language Generation” In ACM Comput. Surv. 55.12 New York, NY, USA: Association for Computing Machinery, 2023, pp. 1–38 DOI: 10.1145/3571730
- “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition” In Empirical Methods in Natural Language Processing, 2023, pp. 4945–4977
- Mi Zhang, Xudong Pan and Min Yang “JADE: A Linguistics-based Safety Evaluation Platform for LLM” version 2, 2023 arXiv: 10.48550/arXiv.2311.00286
- “Universal Adversarial Triggers for NLP” version 3 In CoRR arXiv:1908.07125, 2019 DOI: 10.48550/arXiv.1908.07125
- “Universal and Transferable Adversarial Attacks on Aligned Language Models” version 2 arXiv, 2023 DOI: https://doi.org/10.48550/arXiv.2307.15043
- Patrick Levi and Christoph P. Neumann “Vocabulary Attack to Hijack Large Language Model Applications” accepted for publication In Proc of the 15th International Conference on Cloud Computing, GRIDs, and Virtualization (Cloud Computing 2024), 2024
- 2024.03.27, https://ai.meta.com/llama/get-started/
- “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In CoRR abs/1910.10683, 2019 DOI: https://doi.org/10.48550/arXiv.1910.10683