How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (2401.06373v2)
Abstract: Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As LLMs become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over $92\%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs
- Scarcity messages. Journal of Advertising, 40(3):19–30.
- Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132.
- Anthropic. 2023. Model card and evaluations for claude models.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Young children’s persuasion in everyday conversation: Tactics and attunement to others’ mental states. Social Development, 19(2):394–416.
- Helena Bilandzic and Rick Busselle. 2013. Narrative persuasion. The SAGE handbook of persuasion: Developments in theory and practice, pages 200–219.
- Ted Brader. 2005. Striking a responsive chord: How political ads motivate and persuade voters by appealing to emotions. American Journal of Political Science, 49(2):388–405.
- Adaptation in dyadic interaction: Defining and operationalizing patterns of reciprocity and compensation. Communication Theory, 3(4):295–316.
- Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348.
- Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.
- Jiaao Chen and Diyi Yang. 2021. Weakly-supervised hierarchical models for predicting persuasive strategies in good-faith textual requests. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12648–12656.
- Robert B Cialdini. 2001. The science of persuasion. Scientific American, 284(2):76–81.
- Robert B Cialdini and Noah J Goldstein. 2004. Social influence: Compliance and conformity. Annu. Rev. Psychol., 55:591–621.
- Gary Lynn Cronkhite. 1964. Logic, emotion, and the paradigm of persuasion. Quarterly Journal of Speech, 50(1):13–18.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
- Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474.
- Nicholas DiFonzo and Prashant Bordia. 2011. Rumors influence: Toward a dynamic social impact theory of rumor. In The science of social influence, pages 271–295. Psychology Press.
- James Price Dillard and Leanne K Knobloch. 2011. Interpersonal influence. The Sage handbook of interpersonal communication, pages 389–422.
- What’s in my big data? arXiv preprint arXiv:2310.20707.
- Robert H Gass and John S Seiter. 2022. Persuasion: Social influence and compliance gaining. Routledge.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462.
- Erving Goffman. 1974. Frame analysis: An essay on the organization of experience. Harvard University Press.
- Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
- Keise Izuma. 2013. The neural basis of social influence and attitude change. Current opinion in neurobiology, 23(3):456–462.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614.
- Perspectives on ethics in persuasion. Persuasion: Reception and responsibility, pages 39–70.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381.
- Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733.
- Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
- Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
- Jamie Luguri and Lior Jacob Strahilevitz. 2021. Shining a light on dark patterns. Journal of Legal Analysis, 13(1):43–109.
- Dark patterns at scale: Findings from a crawl of 11k shopping websites. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW):1–32.
- Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv preprint arXiv:2308.12833.
- Dark patterns: Past, present, and future: The evolution of tricky user interfaces. Queue, 18(2):67–92.
- Daniel O’Keefe. 2016. Evidence-based advertising using persuasion principles: Predictive validity and proof of concept. European Journal of Marketing, 50(1/2):294–300.
- James M Olson and Mark P Zanna. 1990. Self-inference processes: The ontario symposium, vol. 6. In This volume consists of expanded versions of papers originally presented at the Sixth Ontario Symposium on Personality and Social Psychology held at the University of Western Ontario, Jun 4-5, 1988. Lawrence Erlbaum Associates, Inc.
- OpenAI. 2023. Gpt-4 technical report.
- Daniel J O’keefe. 2018. Persuasion. In The Handbook of Communication Skills, pages 319–335. Routledge.
- Richard M.. Perloff. 2017. The Dynamics of Persuasion: Communication and Attitudes in the 21st Century. Routledge.
- Emotional factors in attitudes and persuasion. Handbook of affective sciences, 752:772.
- Chanthika Pornpitakpan. 2004. The persuasiveness of source credibility: A critical review of five decades’ evidence. Journal of applied social psychology, 34(2):243–281.
- Penny Powers. 2007. Persuasion and coercion: a critical review of philosophical and empirical approaches. HEC F., 19:125.
- Fine-tuning aligned language models compromises safety, even when users do not intend to!
- Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.
- Soo Young Rieh and David R Danielson. 2007. Credibility: A multidisciplinary framework.
- Smoothllm: Defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.
- Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Alex Wang. 2005. The effects of expert and consumer endorsements on audience response. Journal of advertising research, 45(4):402–412.
- Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950.
- Persuasion for good: Towards a personalized persuasive dialogue system for social good. arXiv preprint arXiv:1906.06725.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Self-persuasion via self-reflection. In Self-Inference Processes: The Ontario Symposium, J. Olson, M. Zanna, Eds.(Erlbaum, Hillsdale, NJ, 1990), volume 6, pages 43–67.
- When consumers and brands talk: Storytelling theory and research in psychology and marketing. Psychology & Marketing, 25(2):97–145.
- Chloe Xiang. 2023. “he would still be here”: Man dies by suicide after talking with ai chatbot, widow says.
- The earth is flat because…: Investigating llms’ belief towards misinformation via persuasive conversation. arXiv preprint arXiv:2312.09085.
- Shadow alignment: The ease of subverting safely-aligned language models. arXiv preprint arXiv:2310.02949.
- Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Yi Zeng (153 papers)
- Hongpeng Lin (3 papers)
- Jingwen Zhang (54 papers)
- Diyi Yang (151 papers)
- Ruoxi Jia (88 papers)
- Weiyan Shi (41 papers)