Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs (2404.07242v1)
Abstract: LLMs are increasingly being developed and applied, but their widespread use faces challenges. These include aligning LLMs' responses with human values to prevent harmful outputs, which is addressed through safety training methods. Even so, bad actors and malicious users have succeeded in attempts to manipulate the LLMs to generate misaligned responses for harmful questions such as methods to create a bomb in school labs, recipes for harmful drugs, and ways to evade privacy rights. Another challenge is the multilingual capabilities of LLMs, which enable the model to understand and respond in multiple languages. Consequently, attackers exploit the unbalanced pre-training datasets of LLMs in different languages and the comparatively lower model performance in low-resource languages than high-resource ones. As a result, attackers use a low-resource languages to intentionally manipulate the model to create harmful responses. Many of the similar attack vectors have been patched by model providers, making the LLMs more robust against language-based manipulation. In this paper, we introduce a new black-box attack vector called the \emph{Sandwich attack}: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. Our experiments with five different models, namely Google's Bard, Gemini Pro, LLaMA-2-70-B-Chat, GPT-3.5-Turbo, GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to generate harmful responses and elicit misaligned responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Towards building a robust toxicity predictor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pp. 581–598, 2023.
- Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023a.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023b.
- Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.
- Pretraining language models with human preferences. In International Conference on Machine Learning, pp. 17506–17533. PMLR, 2023.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- Comprehensive evaluation of chatgpt reliability through multilingual inquiries. arXiv preprint arXiv:2312.10524, 2023.
- Interpretability and transparency-driven detection and transformation of textual adversarial examples (it-dt). arXiv preprint arXiv:2307.01225, 2023.
- Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
- The attentional blink. Trends in cognitive sciences, 1(8):291–296, 1997.
- Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint arXiv:2310.10844, 2023.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Taco: Enhancing cross-lingual transfer for low-resource languages in llms through translation-assisted chain-of-thought processes. arXiv preprint arXiv:2311.10797, 2023.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Cognitive overload: Jailbreaking large language models with overloaded logical thinking. arXiv preprint arXiv:2311.09827, 2023.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446, 2023.
- Text-crs: A generalized certified robustness framework against textual adversarial attacks. arXiv preprint arXiv:2307.16630, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Bibek Upadhayay (6 papers)
- Vahid Behzadan (31 papers)