Comprehensive Examination of Jailbreak Attacks Against GPT-4 and Multimodal LLMs
Introduction to Jailbreak Attacks
Jailbreak attacks on LLMs and Multimodal LLMs (MLLMs) pose significant risks as they can elicit harmful or unethical responses from models designed to avoid generating such content. This work explores evaluating the robustness of state-of-the-art (SOTA) proprietary and open-source models, including GPT-4 and GPT-4V, against an array of textual and visual jailbreak attack methods. Prior methodologies lack a universal benchmark for fair performance comparison, and there is a notable scarcity in the comprehensive assessment of commercial, top-tier models against jailbreak attacks. This gap is bridged by introducing a meticulously curated jailbreak evaluation dataset comprising 1445 questions spread across 11 different safety policies. The investigation extends to 11 different LLMs and MLLMs, revealing nuances in model robustness and method transferability.
Dataset and Experimentation Framework
To scaffold a universal evaluation framework, a broad and diverse jailbreak dataset was assembled from existing literature, covering a spectrum of harmful behaviors and questions across 11 varied safety policies. The dataset serves as the foundation for exhaustive red-teaming experiments on both proprietary (GPT-4, GPT-4V) and open-source models (Llama2, MiniGPT4). Techniques employed in these experiments ranged from hand-crafted modifications to sophisticated optimization-based attacks, aiming to skirt around the models' built-in safety measures.
Key Findings from Red-Teaming Experiments
Model Robustness Against Jailbreak Attacks
- GPT-4 and GPT-4V exhibit superior robustness over their open-source counterparts, displaying a lower susceptibility to both textual and visual jailbreak methods.
- Among the open-source models assessed, Llama2 emerges as notably robust, presenting a compelling case for its safety alignment training, despite being more vulnerable to certain automatic jailbreak methods than GPT-4.
- Transferability of Jailbreak Methods: The paper found that textual modification methods, such as AutoDAN, showcased a higher degree of transferability than visual methods when employed against different models.
Insights into Jailbreak Methodologies
- No singular jailbreak method proved universally dominant across all models tested, underscoring the diversity in model vulnerabilities and the nuanced nature of model defenses.
- Visual jailbreak methods, despite their conceptual appeal, demonstrated limited efficacy against GPT-4V, hinting at robust underlying mechanisms to counter such attacks.
Implications and Future Directions
The differential robustness of proprietary models like GPT-4 and GPT-4V compared to open-source variants underscores a significant gap that merits further exploration. Specifically, the paper illuminates the critical need for advancing safety regulations and defenses in LLMs and MLLMs, especially as these models become increasingly integrated into real-world applications. Additionally, it hints at the potential for future work to focus on refining visual jailbreak methodologies and exploring more sophisticated transferability mechanisms.
The profound insights garnered from the extensive red-teaming effort offer a granular view into the current state of model vulnerabilities and defenses against jailbreak attacks. Moving forward, this work will undoubtedly spur further research into developing more resilient AI models, driving the evolution of effective countermeasures against evolving threats in the rapidly advancing landscape of LLMs and MLLMs.