Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
88 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
52 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Plentiful Jailbreaks with String Compositions (2411.01084v3)

Published 1 Nov 2024 in cs.CL

Abstract: LLMs remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent vulnerability even in advanced LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969, 2024.
  2. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. arXiv preprint arXiv:2404.02151, 2024.
  3. Many-shot jailbreaking. https://www.anthropic.com/research/many-shot-jailbreaking, 2024. Online; accessed September 13, 2024.
  4. Boaz Barak. Another jailbreak for GPT4: Talk to it in Morse code, 2023. URL https://x.com/boazbaraktcs/status/1637657623100096513.
  5. Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv preprint arXiv:2310.08419, 2023.
  6. Hotflip: White-box adversarial examples for text classification. ACL, 2017.
  7. Attacking large language models with projected gradient descent. arXiv preprint arXiv:2402.09154, 2024.
  8. Gradient-based adversarial attacks against text transformers. EMNLP, 2021.
  9. Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679, 2024.
  10. Endless jailbreaks with bijection learning. arXiv preprint, 2024.
  11. Artprompt: Ascii art-based jailbreak attacks against aligned llms. arXiv preprint arXiv:2402.11753, 2024.
  12. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks. ICML AdvML-Frontiers Workshop, 2023.
  13. Llm defenses are not robust to multi-turn human jailbreaks yet. arXiv preprint arXiv:2408.15221, 2024.
  14. Accelerated Coordinate Gradient, 2024. URL https://blog.haizelabs.com/posts/acg/.
  15. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249, 2024.
  16. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. arXiv preprint arXiv:2312.02119, 2023.
  17. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833, 2024.
  18. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. EMNLP, 2020.
  19. Red-teaming language models with dspy, 2024. URL https://blog.haizelabs.com/posts/dspy/.
  20. Pliny the Prompter. L1B3RT45: JAILBREAKS FOR ALL FLAGSHIP AI MODELS, 2024. URL https://github.com/elder-plinius/L1B3RT45.
  21. Fluent student-teacher redteaming, 2024. URL https://confirmlabs.org/papers/flrt.pdf.
  22. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
  23. Low-Resource Languages Jailbreak GPT-4. NeurIPS Workshop on Socially Responsible Language Modelling Research (SoLaR), 2023.
  24. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. ICLR, 2024.
  25. Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288, 2024.
  26. Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023.
  27. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv preprint arXiv:2307.15043, 2023.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com