Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology (2402.15690v1)

Published 24 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs have gradually become the gateway for people to acquire new knowledge. However, attackers can break the model's security protection ("jail") to access restricted information, which is called "jailbreaking." Previous studies have shown the weakness of current LLMs when confronted with such jailbreaking attacks. Nevertheless, comprehension of the intrinsic decision-making mechanism within the LLMs upon receipt of jailbreak prompts is noticeably lacking. Our research provides a psychological explanation of the jailbreak prompts. Drawing on cognitive consistency theory, we argue that the key to jailbreak is guiding the LLM to achieve cognitive coordination in an erroneous direction. Further, we propose an automatic black-box jailbreaking method based on the Foot-in-the-Door (FITD) technique. This method progressively induces the model to answer harmful questions via multi-step incremental prompts. We instantiated a prototype system to evaluate the jailbreaking effectiveness on 8 advanced LLMs, yielding an average success rate of 83.9%. This study builds a psychological perspective on the explanatory insights into the intrinsic decision-making logic of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. 2024. Claude 2. [Online; accessed 22. Jan. 2024].
  2. 2024a. THUDM/chatglm2-6b ⋅⋅\cdot⋅ Hugging Face. [Online; accessed 10. Feb. 2024].
  3. 2024b. THUDM/chatglm3-6b ⋅⋅\cdot⋅ Hugging Face. [Online; accessed 10. Feb. 2024].
  4. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning.
  5. Playing repeated games with large language models. ArXiv, abs/2305.16867.
  6. Constitutional ai: Harmlessness from ai feedback. ArXiv, abs/2212.08073.
  7. Daryl J. Bem. 1967. Self-perception: An alternative interpretation of cognitive dissonance phenomena. Psychological review, 743:183–200.
  8. Jailbreaking black box large language models in twenty queries. ArXiv, abs/2310.08419.
  9. Masterkey: Automated jailbreak across multiple large language model chatbots. In Network and Distributed System Security.
  10. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. ArXiv, abs/2311.08268.
  11. Leon Festinger. 1957. A theory of cognitive dissonance.
  12. FlowGPT. 2023. Explore and Browse ChatGPT Prompts on FlowGPT. [Online; accessed 4. May 2023].
  13. Jonathan L. Freedman and Scott C. Fraser. 1966. Compliance without pressure: the foot-in-the-door technique. Journal of personality and social psychology, 42:195–202.
  14. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. ArXiv, abs/2209.07858.
  15. Wes Gurnee and Max Tegmark. 2023. Language models represent space and time. ArXiv, abs/2310.02207.
  16. Thilo Hagendorff. 2023. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. ArXiv, abs/2303.13988.
  17. Jailbreak. 2023. Jailbreak Chat.
  18. Wolfgang Köhler. 1943. Gestalt psychology. Psychologische Forschung, 31:XVIII–XXX.
  19. Open sesame! universal black box jailbreaking of large language models. ArXiv, abs/2309.01446.
  20. Large language models understand and can be enhanced by emotional stimuli.
  21. Emergent world representations: Exploring a sequence model trained on a synthetic task. ArXiv, abs/2210.13382.
  22. Deepinception: Hypnotize large language model to be jailbreaker. ArXiv, abs/2311.03191.
  23. Autodan: Generating stealthy jailbreak prompts on aligned large language models. ArXiv, abs/2310.04451.
  24. Jailbreaking chatgpt via prompt engineering: An empirical study. ArXiv, abs/2305.13860.
  25. Samuel Marks and Max Tegmark. 2023. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. ArXiv, abs/2310.06824.
  26. Locating and editing factual associations in gpt. In Neural Information Processing Systems.
  27. Meta. 2023. Llama-2-7b-chat-hf ⋅⋅\cdot⋅ Hugging Face.
  28. OpenAI. 2023a. Introducing ChatGPT.
  29. OpenAI. 2023b. OpenAI Moderation API.
  30. OpenAI. 2023c. Our approach to AI safety.
  31. Argumentation and persuasion in the cognitive coherence theory. In Comma.
  32. Machine behaviour. Nature, page 477–486.
  33. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. ArXiv, abs/2308.03825.
  34. Anirudh V. K. 2023. ChatGPT in Grandma Mode will Spill All Your Secrets. Analytics India Magazine.
  35. Jailbroken: How does llm safety training fail? ArXiv, abs/2307.02483.
  36. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, 5:1486–1496.
  37. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. ArXiv, abs/2309.10253.
  38. Jade: A linguistic-based safety evaluation platform for llm.
  39. Judging llm-as-a-judge with mt-bench and chatbot arena.
  40. Rethinking machine ethics - can llms perform moral reasoning through the lens of moral theories? ArXiv, abs/2308.15399.
  41. Representation engineering: A top-down approach to ai transparency. ArXiv, abs/2310.01405.
  42. Universal and transferable adversarial attacks on aligned language models. ArXiv, abs/2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zhenhua Wang (74 papers)
  2. Wei Xie (151 papers)
  3. Baosheng Wang (4 papers)
  4. Enze Wang (4 papers)
  5. Zhiwen Gui (2 papers)
  6. Shuoyoucheng Ma (2 papers)
  7. Kai Chen (512 papers)
Citations (9)