Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (2401.06373v2)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As LLMs become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over $92\%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Scarcity messages. Journal of Advertising, 40(3):19–30.
  2. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132.
  3. Anthropic. 2023. Model card and evaluations for claude models.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  5. Young children’s persuasion in everyday conversation: Tactics and attunement to others’ mental states. Social Development, 19(2):394–416.
  6. Helena Bilandzic and Rick Busselle. 2013. Narrative persuasion. The SAGE handbook of persuasion: Developments in theory and practice, pages 200–219.
  7. Ted Brader. 2005. Striking a responsive chord: How political ads motivate and persuade voters by appealing to emotions. American Journal of Political Science, 49(2):388–405.
  8. Adaptation in dyadic interaction: Defining and operationalizing patterns of reciprocity and compensation. Communication Theory, 3(4):295–316.
  9. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
  10. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
  11. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
  12. Jiaao Chen and Diyi Yang. 2021. Weakly-supervised hierarchical models for predicting persuasive strategies in good-faith textual requests. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12648–12656.
  13. Robert B Cialdini. 2001. The science of persuasion. Scientific American, 284(2):76–81.
  14. Robert B Cialdini and Noah J Goldstein. 2004. Social influence: Compliance and conformity. Annu. Rev. Psychol., 55:591–621.
  15. Gary Lynn Cronkhite. 1964. Logic, emotion, and the paradigm of persuasion. Quarterly Journal of Speech, 50(1):13–18.
  16. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  17. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
  18. Nicholas DiFonzo and Prashant Bordia. 2011. Rumors influence: Toward a dynamic social impact theory of rumor. In The science of social influence, pages 271–295. Psychology Press.
  19. James Price Dillard and Leanne K Knobloch. 2011. Interpersonal influence. The Sage handbook of interpersonal communication, pages 389–422.
  20. What’s in my big data? arXiv preprint arXiv:2310.20707.
  21. Robert H Gass and John S Seiter. 2022. Persuasion: Social influence and compliance gaining. Routledge.
  22. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
  23. Erving Goffman. 1974. Frame analysis: An essay on the organization of experience. Harvard University Press.
  24. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
  25. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
  26. Keise Izuma. 2013. The neural basis of social influence and attitude change. Current opinion in neurobiology, 23(3):456–462.
  27. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
  28. Perspectives on ethics in persuasion. Persuasion: Reception and responsibility, pages 39–70.
  29. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381.
  30. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
  31. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
  32. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
  33. Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
  34. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
  35. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
  36. Jamie Luguri and Lior Jacob Strahilevitz. 2021. Shining a light on dark patterns. Journal of Legal Analysis, 13(1):43–109.
  37. Dark patterns at scale: Findings from a crawl of 11k shopping websites. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–32.
  38. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv preprint arXiv:2308.12833.
  39. Dark patterns: Past, present, and future: The evolution of tricky user interfaces. Queue, 18(2):67–92.
  40. Daniel O’Keefe. 2016. Evidence-based advertising using persuasion principles: Predictive validity and proof of concept. European Journal of Marketing, 50(1/2):294–300.
  41. James M Olson and Mark P Zanna. 1990. Self-inference processes: The ontario symposium, vol. 6. In This volume consists of expanded versions of papers originally presented at the Sixth Ontario Symposium on Personality and Social Psychology held at the University of Western Ontario, Jun 4-5, 1988. Lawrence Erlbaum Associates, Inc.
  42. OpenAI. 2023. Gpt-4 technical report.
  43. Daniel J O’keefe. 2018. Persuasion. In The Handbook of Communication Skills, pages 319–335. Routledge.
  44. Richard M.. Perloff. 2017. The Dynamics of Persuasion: Communication and Attitudes in the 21st Century. Routledge.
  45. Emotional factors in attitudes and persuasion. Handbook of affective sciences, 752:772.
  46. Chanthika Pornpitakpan. 2004. The persuasiveness of source credibility: A critical review of five decades’ evidence. Journal of applied social psychology, 34(2):243–281.
  47. Penny Powers. 2007. Persuasion and coercion: a critical review of philosophical and empirical approaches. HEC F., 19:125.
  48. Fine-tuning aligned language models compromises safety, even when users do not intend to!
  49. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.
  50. Soo Young Rieh and David R Danielson. 2007. Credibility: A multidisciplinary framework.
  51. Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
  52. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
  53. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  55. Alex Wang. 2005. The effects of expert and consumer endorsements on audience response. Journal of advertising research, 45(4):402–412.
  56. Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950.
  57. Persuasion for good: Towards a personalized persuasive dialogue system for social good. arXiv preprint arXiv:1906.06725.
  58. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  59. Self-persuasion via self-reflection. In Self-Inference Processes: The Ontario Symposium, J. Olson, M. Zanna, Eds.(Erlbaum, Hillsdale, NJ, 1990), volume 6, pages 43–67.
  60. When consumers and brands talk: Storytelling theory and research in psychology and marketing. Psychology & Marketing, 25(2):97–145.
  61. Chloe Xiang. 2023. “he would still be here”: Man dies by suicide after talking with ai chatbot, widow says.
  62. The earth is flat because…: Investigating llms’ belief towards misinformation via persuasive conversation. arXiv preprint arXiv:2312.09085.
  63. Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949.
  64. Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446.
  65. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  66. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
  67. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yi Zeng (153 papers)
  2. Hongpeng Lin (3 papers)
  3. Jingwen Zhang (54 papers)
  4. Diyi Yang (151 papers)
  5. Ruoxi Jia (88 papers)
  6. Weiyan Shi (41 papers)
Citations (172)

Summary

  • The paper presents a novel persuasion taxonomy with 13 high-level strategies and 40 techniques to evaluate LLM vulnerabilities.
  • It employs broad scan and iterative probe studies, showing over 92% attack success on models like GPT-4 and Llama-2.
  • The findings urge a revision of AI safety measures by integrating human-like communication insights to fortify model defenses.

Overview of "How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs"

The research paper explores an intriguing yet under-investigated aspect of AI safety: the susceptibility of LLMs to persuasive human-like communication, which can lead to unintended exposure or manipulation, commonly known as "jailbreaking." It challenges the conventional approach that primarily views LLMs as algorithmic systems or instruction followers by revealing that treating them as human-like communicators introduces notable security vulnerabilities.

Key Contributions and Methodology

The paper proposes a novel angle towards AI safety by leveraging a persuasion taxonomy grounded in decades of social science research on human communication. This taxonomy encompasses 13 high-level strategies and 40 fine-grained persuasion techniques spanning various domains such as psychology, marketing, and sociology. This comprehensive framework is the backbone for generating "Persuasive Adversarial Prompts" (PAP), which systematically probe LLMs to assess their susceptibility to jailbreak under plausible everyday scenarios.

Two empirical studies outlined:

  1. Broad Scan Study: This initial exploration across 14 risk categories reveals how persuasion techniques significantly increase the attack success rate. For instance, PAP achieved a high success rate across various categories, highlighting particular susceptibility to risk categories like fraud/deception and illegal activity. Notably, techniques such as logical appeal and authority endorsement prominently enhance success rates.
  2. In-depth Iterative Probe Study: The research advances to emulate iterative refinement tactics employed by persistent attackers, leveraging previous successful PAPs. Here, the paper expands the evaluation to models including Llama-2 7b Chat, GPT-3.5, GPT-4, and Claude series. The paper reported attack success rates of over 92% on models such as GPT-4 and Llama-2, underscoring how persuasive techniques can elevate effectiveness beyond traditional algorithm-focused attacks.

Implications and Future Directions

These findings point to a significant oversight in current AI safety mechanisms, as persuasive dialogue can exploit LLMs' capabilities to mimic comprehension, particularly in sophisticated models. The research also underscores the necessity for a shift in threat modeling frameworks within the AI community to incorporate these subtle yet profound vulnerabilities that arise from natural human-like interactions.

The paper suggests potential directions for developing adaptive defenses, including the system prompt modification and summarization-based techniques to mitigate PAP threats. These adaptive solutions exhibit promising results even against other distinct types of attack prompts, highlighting a broader applicability.

However, the paper emphasizes considering the balance between safety and utility, as robust defenses could potentially hinder model helpfulness. This consideration propels the discussion toward harmonizing model robustness with functionality, advocating for a differentiated approach aligning with specific model characteristics and deployment contexts.

Conclusion

In summary, this paper enriches the dialogue on AI safety by integrating insights from social sciences and AI communication. It reveals an overlooked dimension of vulnerability within AI systems that could precipitate unintended discompliance with alignment protocols. The research invites a re-examination of underlying assumptions in the AI safety paradigm, emphasizing the increasing importance of addressing risks associated with human-like communication. As AI continues to intertwine with human daily life, acknowledging and bridging these gaps remains imperative for ensuring robust and ethical AI integration.