SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (2407.01902v1)

Published 2 Jul 2024 in cs.CR, cs.AI, and cs.CL

Abstract: The widespread applications of LLMs have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SoP, a simple yet effective framework to design jailbreak prompts automatically. Inspired by the social facilitation concept, SoP generates and optimizes multiple jailbreak characters to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SoP can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SoP achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SoP. Code is available at https://github.com/Yang-Yan-Yang-Yan/SoP.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (7)

Yan Yang (119 papers)
Zeguan Xiao (6 papers)
Xin Lu (164 papers)
Hongru Wang (62 papers)
Hailiang Huang (21 papers)
Guanhua Chen (71 papers)
Yun Chen (134 papers)

Citations (1)

View on Semantic Scholar

GitHub

GitHub - Yang-Yan-Yang-Yan/SoP (2 stars)

SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (2407.01902v1)

Related Papers

GitHub