Iterative Adversarial Attack on Image-guided Story Ending Generation (2305.13208v2)
Abstract: Multimodal learning involves developing models that can integrate information from various sources like images and texts. In this field, multimodal text generation is a crucial aspect that involves processing data from multiple modalities and outputting text. The image-guided story ending generation (IgSEG) is a particularly significant task, targeting on an understanding of complex relationships between text and image data with a complete story text ending. Unfortunately, deep neural networks, which are the backbone of recent IgSEG models, are vulnerable to adversarial samples. Current adversarial attack methods mainly focus on single-modality data and do not analyze adversarial attacks for multimodal text generation tasks that use cross-modal information. To this end, we propose an iterative adversarial attack method (Iterative-attack) that fuses image and text modality attacks, allowing for an attack search for adversarial text and image in an more effective iterative way. Experimental results demonstrate that the proposed method outperforms existing single-modal and non-iterative multimodal attack methods, indicating the potential for improving the adversarial robustness of multimodal text generation models, such as multimodal machine translation, multimodal question answering, etc.
- T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- Y. Su, K. Fan, N. Bach, C.-C. J. Kuo, and F. Huang, “Unsupervised multi-modal neural machine translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 482–10 491.
- P. Liu, H. Cao, and T. Zhao, “Gumbel-attention for multi-modal machine translation,” arXiv preprint arXiv:2103.08862, 2021.
- Q. Sun, Y. Wang, C. Xu, K. Zheng, Y. Yang, H. Hu, F. Xu, J. Zhang, X. Geng, and D. Jiang, “Multimodal dialogue response generation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds. Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 2854–2866. [Online]. Available: https://aclanthology.org/2022.acl-long.204
- S. Wang, Y. Meng, X. Sun, F. Wu, R. Ouyang, R. Yan, T. Zhang, and J. Li, “Modeling text-visual mutual dependency for multi-modal dialog generation,” 2021.
- H. Singh, A. Nasery, D. Mehta, A. Agarwal, J. Lamba, and B. V. Srinivasan, “Mimoqa: Multimodal input multimodal output question answering,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 5317–5332.
- J. Liang, L. Jiang, L. Cao, Y. Kalantidis, J. Li, and A. Hauptmann, “Focal visual-text attention for memex question answering,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, preprint. [Online]. Available: https://ieeexplore.ieee.org/document/8603827
- Q. Huang, C. Huang, L. Mo, J. Wei, Y. Cai, H.-f. Leung, and Q. Li, “Igseg: Image-guided story ending generation,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 3114–3123.
- D. Xue, S. Qian, Q. Fang, and C. Xu, “Mmt: Image-guided story ending generation with multimodal memory transformer,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 750–758.
- C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
- L. Gao, Z. Huang, J. Song, Y. Yang, and H. T. Shen, “Push & pull: Transferable adversarial examples with attentive attack,” IEEE Transactions on Multimedia, vol. 24, pp. 2329–2338, 2021.
- L. Li, R. Ma, Q. Guo, X. Xue, and X. Qiu, “Bert-attack: Adversarial attack against bert using bert,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6193–6202.
- J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi, “Black-box generation of adversarial text sequences to evade deep learning classifiers,” in 2018 IEEE Security and Privacy Workshops (SPW). IEEE, 2018, pp. 50–56.
- J. Chen, C. Wang, K. Wang, C. Yin, C. Zhao, T. Xu, X. Zhang, Z. Huang, M. Liu, and T. Yang, “Heu emotion: a large-scale database for multimodal emotion recognition in the wild,” Neural Computing and Applications, vol. 33, no. 14, pp. 8669–8685, 2021.
- I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
- J. Li, S. Ji, T. Du, B. Li, and T. Wang, “Textbugger: Generating adversarial text against real-world applications,” arXiv preprint arXiv:1812.05271, 2018.
- M. Cheng, J. Yi, P.-Y. Chen, H. Zhang, and C.-J. Hsieh, “Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 04, 2020, pp. 3601–3608.
- P. Michel, X. Li, G. Neubig, and J. Pino, “On evaluation of adversarial perturbations for sequence-to-sequence models,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 3103–3114.
- J. Zhang, Q. Yi, and J. Sang, “Towards adversarial attack on vision-language pre-training models,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5005–5013.
- S. Liu, H. Fan, S. Qian, Y. Chen, W. Ding, and Z. Wang, “Hit: Hierarchical transformer with momentum contrast for video-text retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 915–11 925.
- Y. Wang, S. Qian, J. Hu, Q. Fang, and C. Xu, “Fake news detection via knowledge-driven multimodal graph convolutional networks,” in Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 540–547.
- Z. Zhang, X. Wang, G. Lu, F. Shen, and L. Zhu, “Targeted attack of deep hashing via prototype-supervised adversarial networks,” IEEE Transactions on Multimedia, vol. 24, pp. 3392–3404, 2021.
- Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9185–9193.
- A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
- D. Jin, Z. Jin, J. T. Zhou, and P. Szolovits, “Is bert really robust? natural language attack on text classification and entailment,” arXiv preprint arXiv:1907.11932, vol. 2, 2019.
- S. Ren, Y. Deng, K. He, and W. Che, “Generating natural language adversarial examples through probability weighted word saliency,” in Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 1085–1097.
- E. Wallace, M. Stern, and D. Song, “Imitation attacks and defenses for black-box machine translation systems,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 5531–5546.
- X. Du and C.-M. Pun, “Robust audio patch attacks using physical sample simulation and adversarial patch noise generation,” IEEE Transactions on Multimedia, vol. 24, pp. 4381–4393, 2021.
- H. Yuan, Q. Chu, F. Zhu, R. Zhao, B. Liu, and N. Yu, “Automa: Towards automatic model augmentation for transferable adversarial attacks,” IEEE Transactions on Multimedia, vol. 25, pp. 203–213, 2023.
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
- Y. Wang, J. Liu, X. Chang, R. J. Rodríguez, and J. Wang, “Di-aa: An interpretable white-box attack for fooling deep neural networks,” Information Sciences, vol. 610, pp. 14–32, 2022.
- J. Shen and N. Robertson, “Bbas: Towards large scale effective ensemble adversarial attacks against deep neural network learning,” Information Sciences, vol. 569, pp. 469–478, 2021.
- M. Naseer, S. Khan, M. Hayat, F. S. Khan, and F. Porikli, “A self-supervised approach for adversarial robustness,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 262–271.
- B. Wang, H. Pei, B. Pan, Q. Chen, S. Wang, and B. Li, “T3: Tree-autoencoder constrained adversarial text generation for targeted attack,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6134–6150.
- S. Lee, D. B. Lee, and S. J. Hwang, “Contrastive learning with adversarial perturbations for conditional text generation,” in International Conference on Learning Representations, 2020.
- G. Boateng, “Towards real-time multimodal emotion recognition among couples,” in Proceedings of the 2020 International Conference on Multimodal Interaction, 2020, pp. 748–753.
- Y. Long, P. Tang, H. Wang, and J. Yu, “Improving reasoning with contrastive visual information for visual question answering,” Electronics Letters, vol. 57, no. 20, pp. 758–760, 2021.
- X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell, and D. Song, “Fooling vision and language models despite localization and attention mechanism,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4951–4961.
- A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, “Don’t just assume; look and answer: Overcoming priors for visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4971–4980.
- M. Shah, X. Chen, M. Rohrbach, and D. Parikh, “Cycle-consistency for robust visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6649–6658.
- K. Yang, W.-Y. Lin, M. Barman, F. Condessa, and Z. Kolter, “Defending multimodal fusion models against single-source adversaries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3340–3349.
- Z. Zhou, S. Hu, M. Li, H. Zhang, Y. Zhang, and H. Jin, “Advclip: Downstream-agnostic adversarial examples in multimodal contrastive learning,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6311–6320.
- T. Wang, L. Zhu, Z. Zhang, H. Zhang, and J. Han, “Targeted adversarial attack against deep cross-modal hashing retrieval,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
- L. Zhu, T. Wang, J. Li, Z. Zhang, J. Shen, and X. Wang, “Efficient query-based black-box attack against cross-modal hashing retrieval,” ACM Transactions on Information Systems, vol. 41, no. 3, pp. 1–25, 2023.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318.
- A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for movie description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3202–3212.
- D. Elliott, S. Frank, K. Sima’an, and L. Specia, “Multi30k: Multilingual english-german image descriptions,” in Proceedings of the 5th Workshop on Vision and Language, 2016, pp. 70–74.
- B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2641–2649.
- S. Sadrizadeh, L. Dolamic, and P. Frossard, “Transfool: An adversarial attack against neural machine translation models,” arXiv preprint arXiv:2302.00944, 2023.
- M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- S. Banerjee and A. Lavie, “Meteor: An automatic metric for mt evaluation with improved correlation with human judgments,” in Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005, pp. 65–72.
- R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4566–4575.
- C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.
- J. Ebrahimi, D. Lowd, and D. Dou, “On adversarial examples for character-level neural machine translation,” in Proceedings of the 27th International Conference on Computational Linguistics, 2018, pp. 653–663.
- M. Popović, “chrf: character n-gram f-score for automatic mt evaluation,” in Proceedings of the tenth workshop on statistical machine translation, 2015, pp. 392–395.
- Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H. Sung et al., “Multilingual universal sentence encoder for semantic retrieval,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020, pp. 87–94.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
- R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
- S. Yao and X. Wan, “Multimodal transformer for multimodal machine translation,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 4346–4350.
- R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 86–96.
- Q. Sun, Y. Wang, C. Xu, K. Zheng, Y. Yang, H. Hu, F. Xu, J. Zhang, X. Geng, and D. Jiang, “Multimodal dialogue response generation,” arXiv preprint arXiv:2110.08515, 2021.