JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models (2407.01599v2)

Published 26 Jun 2024 in cs.CL, cs.CR, cs.CV, and cs.LG

Abstract: The rapid evolution of AI through developments in LLMs and Vision-LLMs (VLMs) has brought significant advancements across various technological domains. While these models enhance capabilities in natural language processing and visual interactive tasks, their growing adoption raises critical concerns regarding security and ethical alignment. This survey provides an extensive review of the emerging field of jailbreaking--deliberately circumventing the ethical and operational boundaries of LLMs and VLMs--and the consequent development of defense mechanisms. Our study categorizes jailbreaks into seven distinct types and elaborates on defense strategies that address these vulnerabilities. Through this comprehensive examination, we identify research gaps and propose directions for future studies to enhance the security frameworks of LLMs and VLMs. Our findings underscore the necessity for a unified perspective that integrates both jailbreak strategies and defensive solutions to foster a robust, secure, and reliable environment for the next generation of LLMs. More details can be found on our website: \url{https://chonghan-chen.com/LLM-jailbreak-zoo-survey/}.

PDF Abstract

JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-LLMs

The paper "JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-LLMs" offers a comprehensive examination of the phenomena known as jailbreaking in the context of LLMs and Vision-LLMs (VLMs). The paper systematically categorizes various jailbreaking strategies and discusses corresponding defense mechanisms while identifying potential future directions in the field.

Overview of Jailbreak Strategies

The paper categorizes jailbreaking LLMs into five primary types, each exploiting different aspects of the models:

Gradient-based Jailbreaks: These attacks utilize gradients to adjust inputs and generate adversarial prompts compelling LLMs to produce harmful responses. Techniques such as Greedy Coordinate Gradient (GCG) and AutoDAN exemplify this approach.
Evolutionary-based Jailbreaks: These methods employ genetic algorithms and evolutionary strategies to optimize adversarial prompts. Tools like FuzzLLM and GPTFUZZER fall into this category.
Demonstration-based Jailbreaks: Here, specific static prompts are crafted to elicit desired responses, exemplified by the DAN and MJP methods.
Rule-based Jailbreaks: These involve decomposing and redirecting malicious prompts through predefined rules to evade detection, as seen in ReNeLLM and CodeAttack.
Multi-agent-based Jailbreaks: This strategy leverages the cooperation of multiple models to iteratively refine jailbreak prompts, illustrated by methods such as PAIR and GUARD.

For VLMs, the paper identifies three main types of jailbreaks:

Prompt-to-Image Injection: This method converts textual prompts to visual formats, deceiving the model via typographic inputs, showcased by FigStep.
Prompt-Image Perturbation Injection: This strategy involves altering both visual inputs and associated textual prompts to induce harmful responses. Techniques like OT-Attack and SGA are notable examples.
Proxy Model Transfer Jailbreaks: By using alternative VLMs to generate perturbed images, this method exploits the transferability of adversarial examples, as illustrated by Shayegani et al.’s work.

Overview of Defense Mechanisms

In response to the identified jailbreak strategies, the paper categorizes defense mechanisms for LLMs into six types:

Prompt Detection-based Defenses: Techniques such as perplexity analysis are employed to detect potentially malicious prompts.
Prompt Perturbation-based Defenses: Methods like paraphrasing and BPE-dropout retokenization aim to neutralize adversarial prompts.
Demonstration-based Defenses: Incorporating safety prompts to guide LLMs towards safe responses, such as self-reminders.
Generation Intervention-based Defenses: Approaches like Rain and SafeDecoding intervene in the response generation process to ensure safety.
Response Evaluation-based Defenses: Evaluating and refining responses iteratively to filter out harmful outputs.
Model Fine-tuning-based Defenses: Techniques like adversarial training and knowledge editing modify the model to enhance safety inherently.

For VLMs, the defenses are categorized as follows:

Model Fine-tuning-based Defenses: Leveraging methods like adversarial training and natural language feedback to enhance model safety.
Response Evaluation-based Defenses: Assessing and refining VLM responses during inference to ensure safety, exemplified by ECSO.
Prompt Perturbation-based Defenses: Altering input prompts to detect potential jailbreak attempts by evaluating response consistency, as shown with JailGuard.

Evaluation Methods

The paper highlights various methodologies for evaluating the effectiveness of jailbreak strategies and defense mechanisms. These comprehensive evaluations play a crucial role in understanding and improving the security frameworks of LLMs and VLMs.

Implications and Future Directions

Theoretical and practical implications of this research are vast. By providing a structured overview of jailbreak strategies and defenses, this paper lays the groundwork for future research in developing more robust models. The identification of research gaps and proposed future directions, such as multilingual safety alignment and adaptive defense mechanisms, are critical for advancing the field.

Conclusion

The paper "JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-LLMs" offers an extensive survey of jailbreak tactics and defense mechanisms for LLMs and VLMs. By categorizing attack and defense strategies, the paper provides a unified perspective on the security landscape, essential for developing robust and secure models. The findings and proposed future directions underscore the importance of continuous research and innovation in enhancing the safety and reliability of AI models.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Haibo Jin (35 papers)
Leyang Hu (6 papers)
Xinuo Li (1 paper)
Peiyan Zhang (21 papers)
Chonghan Chen (3 papers)
Jun Zhuang (34 papers)
Haohan Wang (96 papers)

Citations (15)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/HaohanWang/status/1809770461334471042

https://twitter.com/_reachsumit/status/1808528649756540937

https://twitter.com/skim71/status/1809857173838864400

https://twitter.com/junzhuang_/status/1809688362485444907