Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction (2403.10883v2)
Abstract: Despite the substantial advancements in Vision-Language Pre-training (VLP) models, their susceptibility to adversarial attacks poses a significant challenge. Existing work rarely studies the transferability of attacks on VLP models, resulting in a substantial performance gap from white-box attacks. We observe that prior work overlooks the interaction mechanisms between modalities, which plays a crucial role in understanding the intricacies of VLP models. In response, we propose a novel attack, called Collaborative Multimodal Interaction Attack (CMI-Attack), leveraging modality interaction through embedding guidance and interaction enhancement. Specifically, attacking text at the embedding level while preserving semantics, as well as utilizing interaction image gradients to enhance constraints on perturbations of texts and images. Significantly, in the image-text retrieval task on Flickr30K dataset, CMI-Attack raises the transfer success rates from ALBEF to TCL, $\text{CLIP}{\text{ViT}}$ and $\text{CLIP}{\text{CNN}}$ by 8.11%-16.75% over state-of-the-art methods. Moreover, CMI-Attack also demonstrates superior performance in cross-task generalization scenarios. Our work addresses the underexplored realm of transfer attacks on VLP models, shedding light on the importance of modality interaction for enhanced adversarial robustness.
- Z. Cui, Y. Hu, Y. Sun, J. Gao, and B. Yin, “Cross-modal alignment with graph reasoning for image-text retrieval,” Multim. Tools Appl., pp. 23 615–23 632, 2022.
- M. Cao, S. Li, J. Li, L. Nie, and M. Zhang, “Image-text retrieval: A survey on recent research and development,” in IJCAI, 2022, pp. 5410–5417.
- J. Zhuang, J. Yu, Y. Ding, X. Qu, and Y. Hu, “Towards fast and accurate image-text retrieval with self-supervised fine-grained alignment,” IEEE Trans. Multimedia, 2023.
- M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, and R. Cucchiara, “From show to tell: A survey on image captioning,” CoRR, vol. abs/2107.06912, 2021.
- J. Zhang, Q. Yi, and J. Sang, “Towards adversarial attack on vision-language pre-training models,” in ACM Int. Conf. Multimedia, 2022, pp. 5005–5013.
- A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in Int. Conf. Learn. Represent., 2018.
- L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu, “BERT-ATTACK: adversarial attack against BERT using BERT,” in Empirical Methods in Natural Language Process., 2020, pp. 6193–6202.
- D. Lu, Z. Wang, T. Wang, W. Guan, H. Gao, and F. Zheng, “Set-level guidance attack: Boosting adversarial transferability of vision-language pre-training models,” in Int. Conf. Comput. Vis., 2023, pp. 102–111.
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Int. Conf. Comput. Vis., 2015.
- T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” in Eur. Conf. Comput. Vis., ser. Lecture Notes in Computer Science, 2014.
- F. Liu, H. Chen, Z. Cheng, A. Liu, L. Nie, and M. Kankanhalli, “Disentangled multimodal representation learning for recommendation,” IEEE Trans. Multimedia, 2022.
- J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Adv. Neural Inform. Process. Syst., 2021.
- J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in IEEE Conf. Comput. Vis. Pattern Recog., 2022.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in Int. Conf. Mach. Learn., ser. Proceedings of Machine Learning Research, 2021.
- I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Int. Conf. Learn. Represent., 2015.
- Z. Chen, B. Li, S. Wu, J. Xu, S. Ding, and W. Zhang, “Shape matters: deformable patch attack,” in Eur. Conf. Comput. Vis., 2022, pp. 529–548.
- Z. Chen, B. Li, S. Wu, S. Ding, and W. Zhang, “Query-efficient decision-based black-box patch attack,” IEEE Trans. Inf. Forensics Secur., 2023.
- Z. Chen, B. Li, S. Wu, K. Jiang, S. Ding, and W. Zhang, “Content-based unrestricted adversarial attack,” Adv. Neural Inform. Process. Syst., vol. 36, 2024.
- A. Kurakin, I. J. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” in Int. Conf. Learn. Represent., 2017.
- D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is BERT really robust? A strong baseline for natural language attack on text classification and entailment,” in AAAI, 2020, pp. 8018–8025.
- S. Garg and G. Ramakrishnan, “BAE: bert-based adversarial examples for text classification,” in Empirical Methods in Natural Language Process., 2020, pp. 6174–6181.
- J. Hessel and L. Lee, “Does my multimodal model learn cross-modal interactions? it’s harder to tell than you might think!” in Empirical Methods in Natural Language Process., 2020, pp. 861–877.
- G. A. Miller, “WordNet: A lexical database for English,” in Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992, 1992.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Process., 2014, pp. 1532–1543.
- J. Lin, C. Song, K. He, L. Wang, and J. E. Hopcroft, “Nesterov accelerated gradient and scale invariance for adversarial attacks,” in Int. Conf. Learn. Represent., 2020.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Int. Conf. Learn. Represent., 2021.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 770–778.
- Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 9185–9193.
- J. Wu, T. Chen, H. Wu, Z. Yang, G. Luo, and L. Lin, “Fine-grained image captioning with global-local discriminative objective,” IEEE Trans. Multimedia, pp. 2413–2427, 2020.
- D. Wang, Z. Hu, Y. Zhou, R. Hong, and M. Wang, “A text-guided generation and refinement model for image captioning,” IEEE Trans. Multimedia, pp. 2966–2977, 2023.
- M. Al-Qatf, X. Wang, A. Hawbani, A. Abdussalam, and S. H. Alsamhi, “Image captioning with novel topics guidance and retrieval-based topics re-weighting,” IEEE Trans. Multimedia, pp. 5984–5999, 2023.
- J. Li, D. Li, C. Xiong, and S. C. H. Hoi, “BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Int. Conf. Mach. Learn., 2022, pp. 12 888–12 900.
- K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, 2002, pp. 311–318.
- S. Banerjee and A. Lavie, “METEOR: an automatic metric for MT evaluation with improved correlation with human judgments,” in Association for Computational Linguistics., 2005, pp. 65–72.
- C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out, Barcelona, Spain, Jul. 2004.
- R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in IEEE Conf. Comput. Vis. Pattern Recog., 2015, pp. 4566–4575.
- P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: semantic propositional image caption evaluation,” in Eur. Conf. Comput. Vis., 2016, pp. 382–398.
- Jiyuan Fu (6 papers)
- Zhaoyu Chen (52 papers)
- Kaixun Jiang (18 papers)
- Haijing Guo (7 papers)
- Jiafeng Wang (5 papers)
- Shuyong Gao (24 papers)
- Wenqiang Zhang (87 papers)