Securing Vision-Language Models with a Robust Encoder Against Jailbreak and Adversarial Attacks

Published 11 Sep 2024 in cs.CV and cs.AI | (2409.07353v1)

Abstract: Large Vision-LLMs (LVLMs), trained on multimodal big datasets, have significantly advanced AI by excelling in vision-language tasks. However, these models remain vulnerable to adversarial attacks, particularly jailbreak attacks, which bypass safety protocols and cause the model to generate misleading or harmful responses. This vulnerability stems from both the inherent susceptibilities of LLMs and the expanded attack surface introduced by the visual modality. We propose Sim-CLIP+, a novel defense mechanism that adversarially fine-tunes the CLIP vision encoder by leveraging a Siamese architecture. This approach maximizes cosine similarity between perturbed and clean samples, facilitating resilience against adversarial manipulations. Sim-CLIP+ offers a plug-and-play solution, allowing seamless integration into existing LVLM architectures as a robust vision encoder. Unlike previous defenses, our method requires no structural modifications to the LVLM and incurs minimal computational overhead. Sim-CLIP+ demonstrates effectiveness against both gradient-based adversarial attacks and various jailbreak techniques. We evaluate Sim-CLIP+ against three distinct jailbreak attack strategies and perform clean evaluations using standard downstream datasets, including COCO for image captioning and OKVQA for visual question answering. Extensive experiments demonstrate that Sim-CLIP+ maintains high clean accuracy while substantially improving robustness against both gradient-based adversarial attacks and jailbreak techniques. Our code and robust vision encoders are available at https://github.com/speedlab-git/Robust-Encoder-against-Jailbreak-attack.git.

Abstract PDF HTML Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper presents Sim-CLIP+, a robust adversarially fine-tuned encoder that significantly mitigates jailbreak and adversarial attacks on LVLMs.
It employs a Siamese architecture with cosine similarity loss and a stop-gradient mechanism to efficiently prevent symmetric loss collapse.
Sim-CLIP+ demonstrates improved defense performance across attacks like ImgJP, VisualAdv, and HADES, ensuring safer multi-modal AI applications.

Securing Vision-LLMs Against Attacks

The paper "Securing Vision-LLMs with a Robust Encoder Against Jailbreak and Adversarial Attacks" (2409.07353) proposes Sim-CLIP+, an adversarially fine-tuned robust encoder designed to protect Large Vision-LLMs (LVLMs) against adversarial and jailbreak attacks. This summary provides an in-depth analysis focusing on how the model can be implemented, its advantages over existing approaches, and its implications in the AI sector.

Introduction to Vulnerabilities in LVLMs

LVLMs are inherently vulnerable to adversarial attacks due to their integration of both vision and language modalities. These vulnerabilities are exacerbated in jailbreak attacks, where safety protocols are bypassed to generate harmful content, reflecting the risk posed by the expanded attack surface in the visual domain.

Figure 1: Jailbreak attack on LVLM: adversarial image paired with harmful instructions is used as input. The adversarial image bypasses the LVLM's safety guardrails, causing it to generate harmful output.

Theoretical Foundations and Methodology

Sim-CLIP+ is based on adversarially fine-tuning the CLIP encoder within a Siamese architecture to fortify LVLMs. The methodology maximizes cosine similarity between clean and perturbed samples, promoting robustness against adversarial inputs:

$L_{\text{cos}}(R_p, R_c) = -\frac{R_p \cdot R_c}{|R_p|_2 |R_c|_2}$

Sim-CLIP+ addresses the challenges of symmetric loss collapse by incorporating a stop-gradient mechanism, thus maintaining computational efficiency while preventing trivial solutions.

Figure 2: Workflow and overview of proposed Sim-CLIP+: (a) CLIP is adversarially fine-tuned on ImageNET dataset tailoring our methodology, and (b) the robust Sim-CLIP+ encoder processes adversarial images alongside harmful text prompts, effectively mitigating jailbreak attempts within the LVLM.

Experimental Results and Analysis

Sim-CLIP+ demonstrates significant resilience against various types of jailbreak attacks, including ImgJP, VisualAdv, and HADES, showcasing the encoder's robustness in both gradient-based and generation-based contexts.

ImgJP Attack: Sim-CLIP+ achieves a significantly lower Attack Success Rate (ASR), outperforming models using the original CLIP encoder and other robust encoders like FARE $^4$ .
VisualAdv Attack: Sim-CLIP+ provides superior defense without external defenses like JailGuard, reducing average toxicity across multiple evaluated attributes.
HADES Attack: Sim-CLIP+ maintains competitive performance, close to state-of-the-art defenses, against generation-based jailbreak attacks.

Figure 3: Qualitative examples of jailbreak attacks on LLaVA (Llama-2-13B) models with original CLIP and Sim-CLIP+ as vision encoders. In both cases, LLaVA with CLIP vision encoder is compromised and outputs malicious content, while LLaVA with Sim-CLIP+ remains robust.

Practical Implications and Future Directions

Sim-CLIP+ demonstrates a capability to enhance LVLMs' security while maintaining clean performance in tasks such as image captioning and Visual Question Answering (VQA). Its plug-and-play nature makes it a valuable addition to existing AI systems without requiring extensive retraining efforts.

In terms of future work, further refinement of Sim-CLIP+ could involve:

Extending defenses to multi-modal architectures integrating additional sensory data.
Exploring automated methods to recognize and mitigate emerging types of adversarial threats.
Enhancing the interpretability of adversarial defenses to foster better trust and reliability in AI deployments.

Conclusion

Sim-CLIP+ significantly mitigates adversarial threats in LVLMs, providing robust, scalable defenses that can be seamlessly integrated into existing architectures. Its deployment holds promise in securing a wide array of applications reliant on safe and reliable multi-modal AI systems.

Markdown