Plentiful Jailbreaks with String Compositions (2411.01084v3)
Abstract: LLMs remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.
- Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969, 2024.
- Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. arXiv preprint arXiv:2404.02151, 2024.
- Many-shot jailbreaking. https://www.anthropic.com/research/many-shot-jailbreaking, 2024. Online; accessed September 13, 2024.
- Boaz Barak. Another jailbreak for GPT4: Talk to it in Morse code, 2023. URL https://x.com/boazbaraktcs/status/1637657623100096513.
- Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv preprint arXiv:2310.08419, 2023.
- Hotflip: White-box adversarial examples for text classification. ACL, 2017.
- Attacking large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024.
- Gradient-based adversarial attacks against text transformers. EMNLP, 2021.
- Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679, 2024.
- Endless jailbreaks with bijection learning. arXiv preprint, 2024.
- Artprompt: Ascii art-based jailbreak attacks against aligned llms. arXiv preprint arXiv:2402.11753, 2024.
- Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. ICML AdvML-Frontiers Workshop, 2023.
- Llm defenses are not robust to multi-turn human jailbreaks yet. arXiv preprint arXiv:2408.15221, 2024.
- Accelerated Coordinate Gradient, 2024. URL https://blog.haizelabs.com/posts/acg/.
- HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249, 2024.
- Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv preprint arXiv:2312.02119, 2023.
- Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833, 2024.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. EMNLP, 2020.
- Red-teaming language models with dspy, 2024. URL https://blog.haizelabs.com/posts/dspy/.
- Pliny the Prompter. L1B3RT45: JAILBREAKS FOR ALL FLAGSHIP AI MODELS, 2024. URL https://github.com/elder-plinius/L1B3RT45.
- Fluent student-teacher redteaming, 2024. URL https://confirmlabs.org/papers/flrt.pdf.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
- Low-Resource Languages Jailbreak GPT-4. NeurIPS Workshop on Socially Responsible Language Modelling Research (SoLaR), 2023.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. ICLR, 2024.
- Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288, 2024.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023.
- Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043, 2023.