Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Jailbreaking to Jailbreak (2502.09638v2)

Published 9 Feb 2025 in cs.CL and cs.AI

Abstract: LLMs can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting $J_2$ (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create $J_2$ attackers transfer across almost all black-box models; 2) an $J_2$ attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong models, such as Sonnet-3.7, are strong $J_2$ attackers compared to others. For example, when used against the safeguard of GPT-4o, $J_2$ (Sonnet-3.7) achieves 0.975 attack success rate (ASR), which matches expert human red teamers and surpasses the state-of-the-art algorithm-based attacks. Among $J_2$ attackers, $J_2$ (o3) achieves highest ASR (0.605) against Sonnet-3.5, one of the most robust models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Jeremy Kritz (3 papers)
  2. Vaughn Robinson (3 papers)
  3. Robert Vacareanu (12 papers)
  4. Bijan Varjavand (3 papers)
  5. Michael Choi (4 papers)
  6. Bobby Gogov (1 paper)
  7. Scale Red Team (3 papers)
  8. Summer Yue (12 papers)
  9. Willow E. Primack (2 papers)
  10. Zifan Wang (75 papers)

Summary

We haven't generated a summary for this paper yet.