Efficient LLM-Jailbreaking by Introducing Visual Modality (2405.20015v1)
Abstract: This paper focuses on jailbreaking attacks against LLMs, eliciting them to generate objectionable content in response to harmful user queries. Unlike previous LLM-jailbreaks that directly orient to LLMs, our approach begins by constructing a multimodal LLM (MLLM) through the incorporation of a visual module into the target LLM. Subsequently, we conduct an efficient MLLM-jailbreak to generate jailbreaking embeddings embJS. Finally, we convert the embJS into text space to facilitate the jailbreaking of the target LLM. Compared to direct LLM-jailbreaking, our approach is more efficient, as MLLMs are more vulnerable to jailbreaking than pure LLM. Additionally, to improve the attack success rate (ASR) of jailbreaking, we propose an image-text semantic matching scheme to identify a suitable initial input. Extensive experiments demonstrate that our approach surpasses current state-of-the-art methods in terms of both efficiency and effectiveness. Moreover, our approach exhibits superior cross-class jailbreaking capabilities.
- Zhenxing Niu (21 papers)
- Yuyao Sun (8 papers)
- Haodong Ren (2 papers)
- Haoxuan Ji (2 papers)
- Quan Wang (130 papers)
- Xiaoke Ma (9 papers)
- Gang Hua (101 papers)
- Rong Jin (164 papers)