Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

110 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

165

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (2401.06373v2)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As LLMs become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over $92\%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs

References (67)

Authors (6)

Yi Zeng (153 papers)
Hongpeng Lin (3 papers)
Jingwen Zhang (54 papers)
Diyi Yang (151 papers)
Ruoxi Jia (88 papers)
Weiyan Shi (41 papers)

Citations (172)

View on Semantic Scholar

Summary

The paper presents a novel persuasion taxonomy with 13 high-level strategies and 40 techniques to evaluate LLM vulnerabilities.
It employs broad scan and iterative probe studies, showing over 92% attack success on models like GPT-4 and Llama-2.
The findings urge a revision of AI safety measures by integrating human-like communication insights to fortify model defenses.

Overview of "How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs"

The research paper explores an intriguing yet under-investigated aspect of AI safety: the susceptibility of LLMs to persuasive human-like communication, which can lead to unintended exposure or manipulation, commonly known as "jailbreaking." It challenges the conventional approach that primarily views LLMs as algorithmic systems or instruction followers by revealing that treating them as human-like communicators introduces notable security vulnerabilities.

Key Contributions and Methodology

The paper proposes a novel angle towards AI safety by leveraging a persuasion taxonomy grounded in decades of social science research on human communication. This taxonomy encompasses 13 high-level strategies and 40 fine-grained persuasion techniques spanning various domains such as psychology, marketing, and sociology. This comprehensive framework is the backbone for generating "Persuasive Adversarial Prompts" (PAP), which systematically probe LLMs to assess their susceptibility to jailbreak under plausible everyday scenarios.

Two empirical studies outlined:

Broad Scan Study: This initial exploration across 14 risk categories reveals how persuasion techniques significantly increase the attack success rate. For instance, PAP achieved a high success rate across various categories, highlighting particular susceptibility to risk categories like fraud/deception and illegal activity. Notably, techniques such as logical appeal and authority endorsement prominently enhance success rates.
In-depth Iterative Probe Study: The research advances to emulate iterative refinement tactics employed by persistent attackers, leveraging previous successful PAPs. Here, the paper expands the evaluation to models including Llama-2 7b Chat, GPT-3.5, GPT-4, and Claude series. The paper reported attack success rates of over 92% on models such as GPT-4 and Llama-2, underscoring how persuasive techniques can elevate effectiveness beyond traditional algorithm-focused attacks.

Implications and Future Directions

These findings point to a significant oversight in current AI safety mechanisms, as persuasive dialogue can exploit LLMs' capabilities to mimic comprehension, particularly in sophisticated models. The research also underscores the necessity for a shift in threat modeling frameworks within the AI community to incorporate these subtle yet profound vulnerabilities that arise from natural human-like interactions.

The paper suggests potential directions for developing adaptive defenses, including the system prompt modification and summarization-based techniques to mitigate PAP threats. These adaptive solutions exhibit promising results even against other distinct types of attack prompts, highlighting a broader applicability.

However, the paper emphasizes considering the balance between safety and utility, as robust defenses could potentially hinder model helpfulness. This consideration propels the discussion toward harmonizing model robustness with functionality, advocating for a differentiated approach aligning with specific model characteristics and deployment contexts.

Conclusion

In summary, this paper enriches the dialogue on AI safety by integrating insights from social sciences and AI communication. It reveals an overlooked dimension of vulnerability within AI systems that could precipitate unintended discompliance with alignment protocols. The research invites a re-examination of underlying assumptions in the AI safety paradigm, emphasizing the increasing importance of addressing risks associated with human-like communication. As AI continues to intertwine with human daily life, acknowledging and bridging these gaps remains imperative for ensuring robust and ethical AI integration.

PDF Markdown

Tweets

https://twitter.com/shi_weiyan/status/1822479484769218984

https://twitter.com/shi_weiyan/status/1822802221756768290

https://twitter.com/Obota_P/status/1749639741899718845

https://twitter.com/TheDevanshMehta/status/1927671891927716130

https://twitter.com/lefthanddraft/status/1936371044249678174

https://twitter.com/TheDevanshMehta/status/1927671636263870905