Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (2407.01902v1)

Published 2 Jul 2024 in cs.CR, cs.AI, and cs.CL

Abstract: The widespread applications of LLMs have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yan Yang (119 papers)
  2. Zeguan Xiao (6 papers)
  3. Xin Lu (164 papers)
  4. Hongru Wang (62 papers)
  5. Hailiang Huang (21 papers)
  6. Guanhua Chen (71 papers)
  7. Yun Chen (134 papers)
Citations (1)
Github Logo Streamline Icon: https://streamlinehq.com