Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model (2402.19150v3)

Published 29 Feb 2024 in cs.CV

Abstract: Large Vision-LLMs (LVLMs) rely on vision encoders and LLMs to exhibit remarkable capabilities on various multi-modal tasks in the joint space of vision and language. However, typographic attacks, which disrupt Vision-LLMs (VLMs) such as Contrastive Language-Image Pretraining (CLIP), have also been expected to be a security threat to LVLMs. Firstly, we verify typographic attacks on current well-known commercial and open-source LVLMs and uncover the widespread existence of this threat. Secondly, to better assess this vulnerability, we propose the most comprehensive and largest-scale Typographic Dataset to date. The Typographic Dataset not only considers the evaluation of typographic attacks under various multi-modal tasks but also evaluates the effects of typographic attacks, influenced by texts generated with diverse factors. Based on the evaluation results, we investigate the causes why typographic attacks impacting VLMs and LVLMs, leading to three highly insightful discoveries. During the process of further validating the rationality of our discoveries, we can reduce the performance degradation caused by typographic attacks from 42.07\% to 13.90\%. Code and Dataset are available in \href{https://github.com/ChaduCheng/TypoDeceptions}

References (44)

Authors (9)

Hao Cheng (190 papers)
Erjia Xiao (13 papers)
Renjing Xu (72 papers)
Jindong Gu (101 papers)
Le Yang (69 papers)
Jinhao Duan (23 papers)
Jize Zhang (19 papers)
Jiahang Cao (39 papers)
Kaidi Xu (85 papers)

Citations (4)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Unveiling Typographic Deceptions: Insights of the Typographic Vulnerability in Large Vision-Language Model (2402.19150v3)

Summary

Related Papers