Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (2402.14872v2)
Abstract: LLMs, used in creative writing, code generation, and translation, generate text based on input sequences but are vulnerable to jailbreak attacks, where crafted prompts induce harmful outputs. Most jailbreak prompt methods use a combination of jailbreak templates followed by questions to ask to create jailbreak prompts. However, existing jailbreak prompt designs generally suffer from excessive semantic differences, resulting in an inability to resist defenses that use simple semantic metrics as thresholds. Jailbreak prompts are semantically more varied than the original questions used for queries. In this paper, we introduce a Semantic Mirror Jailbreak (SMJ) approach that bypasses LLMs by generating jailbreak prompts that are semantically similar to the original question. We model the search for jailbreak prompts that satisfy both semantic similarity and jailbreak validity as a multi-objective optimization problem and employ a standardized set of genetic algorithms for generating eligible prompts. Compared to the baseline AutoDAN-GA, SMJ achieves attack success rates (ASR) that are at most 35.4% higher without ONION defense and 85.2% higher with ONION defense. SMJ's better performance in all three semantic meaningfulness metrics of Jailbreak Prompt, Similarity, and Outlier, also means that SMJ is resistant to defenses that use those metrics as thresholds.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Generative language models and automated influence operations: Emerging threats and potential mitigations. arXiv preprint arXiv:2301.04246, 2023.
- Hazell, J. Large language models can be used to effectively scale spear phishing campaigns. arXiv preprint arXiv:2305.06972, 2023.
- Generating transferable 3d adversarial point cloud via random perturbation factorization. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023.
- Adversarial example generation with syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059, 2018.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
- Laiyer.ai. Fine-tuned deberta-v3 for prompt injection detection, 2023. URL https://huggingface.co/laiyer/deberta-v3-base-prompt-injection.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023a.
- Privacy-enhancing face obfuscation guided by semantic-aware attribution maps. IEEE Transactions on Information Forensics and Security, 2023b.
- Exploring inconsistent knowledge distillation for object detection with data augmentation. In Proceedings of the 31st ACM International Conference on Multimedia, pp. 768–778, 2023a.
- Efficient adversarial attacks for visual object tracking. In CEuropean Conference on Computer Vision, 2020.
- Generate more imperceptible adversarial examples for object detection. In ICML 2021 Workshop on Adversarial Machine Learning, 2021.
- A large-scale multiple-objective method for black-box attack against object detection. In European Conference on Computer Vision, 2022a.
- Imitated detectors: Stealing knowledge of black-box object detectors. In Proceedings of the 30th ACM International Conference on Multimedia, 2022b.
- Parallel rectangle flip attack: A query-based black-box attack against object detection. arXiv preprint arXiv:2201.08970, 2022c.
- Badclip: Dual-embedding guided backdoor attack on multimodal contrastive learning. arXiv preprint arXiv:2311.12075, 2023b.
- Perceptual-sensitive gan for generating adversarial patches. In AAAI, 2019.
- Spatiotemporal attacks for embodied agents. In ECCV, 2020a.
- Bias-based universal adversarial patch attack for automatic check-out. In ECCV, 2020b.
- X-adv: Physical adversarial object attacks against x-ray prohibited item detection. In USENIX Security Symposium, 2023a.
- Exploring the relationship between architectural design and adversarially robust generalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4096–4107, 2023b.
- Pre-trained trojan attacks for visual recognition. arXiv preprint arXiv:2312.15172, 2023c.
- Improving adversarial transferability by stable diffusion. arXiv preprint arXiv:2311.11017, 2023d.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023e.
- Meta. Llama 2 - acceptable use policy - meta ai. https://ai.meta.com/llama/use-policy/, 2024. Accessed: 2024-01-18.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2024a. Accessed: 2024-01-18.
- OpenAI. Usage policies. https://openai.com/policies/usage-policies, 2024b. Accessed: 2024-01-18.
- OpenAI. Moderation. https://platform.openai.com/docs/guides/moderation/overview, 2024c. Accessed: 2024-01-18.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- To chatgpt, or not to chatgpt: That is the question! arXiv preprint arXiv:2304.01487, 2023.
- Onion: A simple and effective defense against textual backdoor attacks. arXiv preprint arXiv:2011.10369, 2020.
- Turn the combination lock: Learnable textual backdoor attacks via word substitution. arXiv preprint arXiv:2106.06361, 2021.
- Semantic textual similarity. https://www.sbert.net/docs/usage/semantic_textual_similarity.html#semantic-textual-similarity, 2022a. Accessed: 2024-01-20.
- Pretrained models. https://www.sbert.net/docs/pretrained_models.html, 2022b. Accessed: 2024-01-20.
- ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Improving robust fariness via balance adversarial training. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 15161–15169, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv. org/abs/2307.09288, 2023.
- Dual attention suppression attack: Generate adversarial camouflage in physical world. In CVPR, 2021.
- Adaptive perturbation generation for multiple backdoors detection. arXiv preprint arXiv:2209.05244, 2022.
- Transferable adversarial attacks for image and video object detection. arXiv preprint arXiv:1811.12641, 2018.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
- Synthetic lies: Understanding ai-generated misinformation and evaluating algorithmic and human solutions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp. 1–20, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Xiaoxia Li (5 papers)
- Siyuan Liang (73 papers)
- Jiyi Zhang (9 papers)
- Han Fang (61 papers)
- Aishan Liu (72 papers)
- Ee-Chien Chang (44 papers)