Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (2402.18104v2)

Published 28 Feb 2024 in cs.CR and cs.AI

Abstract: In recent years, LLMs have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4 chatbot.

Overview of "Making Them Ask and Answer: Jailbreaking LLMs in Few Queries via Disguise and Reconstruction"

The paper "Making Them Ask and Answer: Jailbreaking LLMs in Few Queries via Disguise and Reconstruction" by Tong Liu et al. addresses a critical security issue in modern LLMs—the susceptibility to jailbreaking attacks that induce harmful outputs from these models. The authors propose a novel attack methodology named DRA (Disguise and Reconstruction Attack), which stealthily bypasses the security fine-tuning of LLMs to generate harmful responses with a high success rate.

Motivation and Background

LLMs have demonstrated significant capabilities across multiple domains but remain vulnerable to adversarial attacks that manipulate their output. This paper explores the threat of generating unintended and potentially harmful content through sophisticated prompt engineering, highlighting the limitations of current safety measures.

Methodology: DRA

The proposed methodology, DRA, capitalizes on the biases inherent in the fine-tuning process of LLMs. The method involves three stages:

  1. Harmful Instruction Disguise: This step conceals harmful prompts using techniques like puzzle-based obfuscation and word-level character splits, ensuring the model's input filter does not perceive them as threats.
  2. Payload Reconstruction: Leveraging prompt engineering, this stage reconstructs the disguised harmful instructions at the model’s completion segment, exploiting the fine-tuning biases where models are less safeguarded.
  3. Context Manipulation: By carefully crafting the prompt to manipulate context, this step coaxes the model into generating the intended harmful output.

Empirical Evaluation

The DRA approach was tested on several advanced LLMs including both open-source (such as LLAMA-2 and Vicuna) and closed-source models like GPT-4. Results demonstrated a 90% success rate in GPT-4 chatbots, showcasing the strategy's efficacy. Notably, DRA achieved superior success rates with minimal queries compared to existing methods, highlighting its efficiency and adaptability.

Implications and Future Directions

The findings present profound implications for the development and deployment of LLMs, especially concerning security and ethical content generation. The demonstrated vulnerabilities highlight the need for robust defense mechanisms that extend beyond traditional safety fine-tuning. The authors suggest that future research should focus on developing comprehensive strategies to mitigate these inherent biases in LLM architectures.

DRA not only broadens the understanding of current security vulnerabilities in LLMs but also sets a new direction for enhancing AI safety. As LLM usage expands, ensuring their outputs remain beneficial and ethical becomes paramount, requiring ongoing evaluation and evolution of their safeguarding protocols.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. The trojan detection challenge 2023 (llm edition). https://trojandetection.ai/, 2023.
  2. Universal jailbreak. https://www.jailbreakchat.com/prompt/7f7fa90e-5bd7-406c-b0f2-5d0320c09b47, 2023. Accessed: 08/08/2023.
  3. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  4. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  7. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  8. Masterkey: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
  9. Beyond the safeguards: Exploring the security risks of chatgpt. arXiv preprint arXiv:2305.08005, 2023.
  10. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  11. Google. Bard. https://bard.google.com/. Accessed on 08/08/2023.
  12. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  13. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  14. Backdoor attacks for in-context learning with language models. arXiv preprint arXiv:2307.14692, 2023.
  15. Certifying llm safety against adversarial prompting. arXiv preprint arXiv:2309.02705, 2023.
  16. Demystifying rce vulnerabilities in llm-integrated apps. arXiv preprint arXiv:2309.02926, 2023.
  17. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023.
  18. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  19. Demonstration of insightpilot: An llm-empowered automated data exploration system. arXiv preprint arXiv:2304.00477, 2023.
  20. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  21. OpenAI. Moderation. https://platform.openai.com/docs/guides/moderation/overview. Accessed on 08/08/2023.
  22. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022. Accessed: 08/08/2023.
  23. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  24. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  25. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  26. Jay Peters. The bing ai bot has been secretly running gpt-4. https://www.theverge.com/2023/3/14/23639928/microsoft-bing-chatbot-ai-gpt-4-llm, 2023. Accessed: 02/08/2024.
  27. Comprehensive shellcode detection using runtime heuristics. In Proceedings of the 26th Annual Computer Security Applications Conference, pages 287–296, 2010.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  29. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  30. Elvis Saravia. Prompt Engineering Guide. https://github.com/dair-ai/Prompt-Engineering-Guide, 12 2022.
  31. Protecting software through obfuscation: Can it keep pace with progress in code analysis? ACM Computing Surveys (CSUR), 49(1):1–37, 2016.
  32. Llm4vuln: A unified evaluation framework for decoupling and enhancing llms’ vulnerability reasoning. arXiv preprint arXiv:2401.16185, 2024.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  34. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  35. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
  36. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  37. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. arXiv preprint arXiv:2312.02003, 2023.
  38. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
  39. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  40. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023.
  41. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  42. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
  43. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Tong Liu (316 papers)
  2. Yingjie Zhang (30 papers)
  3. Zhe Zhao (97 papers)
  4. Yinpeng Dong (102 papers)
  5. Guozhu Meng (28 papers)
  6. Kai Chen (512 papers)
Citations (22)
Youtube Logo Streamline Icon: https://streamlinehq.com