Adversarial Testing for Visual Grounding via Image-Aware Property Reduction (2403.01118v1)
Abstract: Due to the advantages of fusing information from various modalities, multimodal learning is gaining increasing attention. Being a fundamental task of multimodal learning, Visual Grounding (VG), aims to locate objects in images through natural language expressions. Ensuring the quality of VG models presents significant challenges due to the complex nature of the task. In the black box scenario, existing adversarial testing techniques often fail to fully exploit the potential of both modalities of information. They typically apply perturbations based solely on either the image or text information, disregarding the crucial correlation between the two modalities, which would lead to failures in test oracles or an inability to effectively challenge VG models. To this end, we propose PEELING, a text perturbation approach via image-aware property reduction for adversarial testing of the VG model. The core idea is to reduce the property-related information in the original expression meanwhile ensuring the reduced expression can still uniquely describe the original object in the image. To achieve this, PEELING first conducts the object and properties extraction and recombination to generate candidate property reduction expressions. It then selects the satisfied expressions that accurately describe the original object while ensuring no other objects in the image fulfill the expression, through querying the image with a visual understanding technique. We evaluate PEELING on the state-of-the-art VG model, i.e. OFA-VG, involving three commonly used datasets. Results show that the adversarial tests generated by PEELING achieves 21.4% in MultiModal Impact score (MMI), and outperforms state-of-the-art baselines for images and texts by 8.2%--15.1%.
- T. Zubatiuk and O. Isayev, “Development of multimodal machine learning potentials: toward a physics-aware artificial intelligence,” Accounts of Chemical Research, vol. 54, no. 7, pp. 1575–1585, 2021.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2019, 2019, p. 6558.
- J. Gao, P. Li, Z. Chen, and J. Zhang, “A survey on deep learning for multimodal data fusion,” Neural Computation, vol. 32, no. 5, pp. 829–864, 2020.
- J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20.
- L. Yang, Y. Xu, C. Yuan, W. Liu, B. Li, and W. Hu, “Improving visual grounding with visual-linguistic verification and iterative reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9499–9508.
- S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2425–2433.
- A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1–10.
- H. Liu, A. Lin, X. Han, L. Yang, Y. Yu, and S. Cui, “Refer-it-in-rgbd: A bottom-up approach for 3d visual grounding in rgbd images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6032–6041.
- Z. Yuan, X. Yan, Y. Liao, R. Zhang, S. Wang, Z. Li, and S. Cui, “Instancerefer: Cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1791–1800.
- Y. Qi, Q. Wu, P. Anderson, X. Wang, W. Y. Wang, C. Shen, and A. v. d. Hengel, “Reverie: Remote embodied visual referring expression in real indoor environments,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9982–9991.
- P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. Van Den Hengel, “Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3674–3683.
- J. Kim, T. Misu, Y.-T. Chen, A. Tawari, and J. Canny, “Grounding human-to-vehicle advice for self-driving vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10 591–10 599.
- M.-M. Cheng, S. Zheng, W.-Y. Lin, V. Vineet, P. Sturgess, N. Crook, N. J. Mitra, and P. Torr, “Imagespirit: Verbal guided image parsing,” ACM Transactions on Graphics (ToG), vol. 34, no. 1, pp. 1–11, 2014.
- D. Turmukhambetov, “Synthesizing and editing photo-realistic visual objects,” Ph.D. dissertation, University College London, UK, 2016. [Online]. Available: https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.746347
- T. Deruyttere, G. Collell, and M. Moens, “Giving commands to a self-driving car: A multimodal reasoner for visual grounding,” CoRR, vol. abs/2003.08717, 2020.
- H. P. Chan, M. Guo, and C. Xu, “Grounding commands for autonomous vehicles via layer fusion with region-specific dynamic layer attention,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2022, Kyoto, Japan, October 23-27, 2022, 2022, pp. 12 464–12 470.
- A. R. Akula, S. Gella, Y. Al-Onaizan, S. Zhu, and S. Reddy, “Words aren’t enough, their order matters: On the robustness of grounding visual referring expressions,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, 2020, pp. 6555–6565.
- D. Hendrycks and T. G. Dietterich, “Benchmarking neural network robustness to common corruptions and perturbations,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
- R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019.
- J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017, pp. 2242–2251.
- Z. Sun, J. M. Zhang, M. Harman, M. Papadakis, and L. Zhang, “Automatic testing and improvement of machine translation,” in ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, 2020, pp. 974–985.
- Z. Sun, J. M. Zhang, Y. Xiong, M. Harman, M. Papadakis, and L. Zhang, “Improving machine translation systems via isotopic replacement,” in 44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022, 2022, pp. 1181–1192.
- J. Rusert, Z. Shafiq, and P. Srinivasan, “On the robustness of offensive language classifiers,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. 7424–7438.
- B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi, “Deep text classification can be fooled,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, 2018, pp. 4208–4215.
- Z. Liu, Y. Feng, and Z. Chen, “Dialtest: automated testing for recurrent-neural-network-driven dialogue systems,” in ISSTA ’21: 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual Event, Denmark, July 11-17, 2021, 2021, pp. 115–126.
- C. E. Shannon, “A mathematical theory of communication,” ACM SIGMOBILE mobile computing and communications review, vol. 5, no. 1, pp. 3–55, 2001.
- L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., 2016, pp. 69–85.
- J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, pp. 11–20.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 13–23.
- H. Tan and M. Bansal, “LXMERT: learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019, pp. 5099–5110.
- P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in International Conference on Machine Learning. PMLR, 2022, pp. 23 318–23 340.
- Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “Can chatgpt understand too? A comparative study on chatgpt and fine-tuned BERT,” CoRR, vol. abs/2302.10198, 2023.
- B. Li, G. Fang, Y. Yang, Q. Wang, W. Ye, W. Zhao, and S. Zhang, “Evaluating chatgpt’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness,” CoRR, vol. abs/2304.11633, 2023.
- T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
- Z. Liu, Y. Feng, Y. Yin, J. Sun, Z. Chen, and B. Xu, “Qatest: A uniform fuzzing framework for question answering systems,” in 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022, 2022, pp. 81:1–81:12.
- M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
- Z. Chen, R. Zhang, Y. Song, X. Wan, and G. Li, “Advancing visual grounding with scene knowledge: Benchmark and method,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, 2023, pp. 15 039–15 049.
- Z. Yang, K. Kafle, F. Dernoncourt, and V. Ordonez, “Improving visual grounding by encouraging consistent gradient-based explanations,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, 2023, pp. 19 165–19 174.
- V. K. Nagaraja, V. I. Morariu, and L. S. Davis, “Modeling context between objects for referring expression understanding,” in Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, 2016, pp. 792–807.
- Q. Shen, J. Chen, J. M. Zhang, H. Wang, S. Liu, and M. Tian, “Natural test generation for precise testing of question answering software,” in 37th IEEE/ACM International Conference on Automated Software Engineering, ASE 2022, Rochester, MI, USA, October 10-14, 2022, 2022, pp. 71:1–71:12.
- S. Chen, S. Jin, and X. Xie, “Testing your question answering software via asking recursively,” in 36th IEEE/ACM International Conference on Automated Software Engineering, ASE 2021, Melbourne, Australia, November 15-19, 2021, 2021, pp. 104–116.
- S. Gupta, “Machine translation testing via pathological invariance,” in ICSE ’20: 42nd International Conference on Software Engineering, Companion Volume, Seoul, South Korea, 27 June - 19 July, 2020, 2020, pp. 107–109.
- P. He, C. Meister, and Z. Su, “Structure-invariant testing for machine translation,” in ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, 27 June - 19 July, 2020, 2020, pp. 961–973.
- J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
- Z. Chang, M. Li, J. Wang, Q. Wang, and S. Li, “Putting them under microscope: a fine-grained approach for detecting redundant test cases in natural language,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2022, Singapore, Singapore, November 14-18, 2022, 2022, pp. 1161–1172.
- L. Yang, Y. Xu, C. Yuan, W. Liu, B. Li, and W. Hu, “Improving visual grounding with visual-linguistic verification and iterative reasoning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 9489–9498.
- J. Qiu, Y. Zhu, X. Shi, F. Wenzel, Z. Tang, D. Zhao, B. Li, and M. Li, “Are multimodal models robust to image and text perturbations?” CoRR, vol. abs/2212.08044, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2212.08044
- M. Eberts and A. Ulges, “Span-based joint entity and relation extraction with transformer pre-training,” in ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020), 2020, pp. 2006–2013.
- G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” in NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016, K. Knight, A. Nenkova, and O. Rambow, Eds. The Association for Computational Linguistics, 2016, pp. 260–270.
- Y. Luan, D. Wadden, L. He, A. Shah, M. Ostendorf, and H. Hajishirzi, “A general framework for information extraction using dynamic span graphs,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 3036–3046.
- C. Swamy and D. B. Shmoys, “Sampling-based approximation algorithms for multi-stage stochastic optimization,” in Probabilistic Methods in the Design and Analysis of Algorithms, 23.09. - 28.09.2007, 2007.
- J. Ma, J. Peng, S. Wang, and J. Xu, “Estimating the partition function of graphical models using langevin importance sampling,” in Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2013, Scottsdale, AZ, USA, April 29 - May 1, 2013, 2013, pp. 433–441.
- N. Papernot, P. D. McDaniel, A. Swami, and R. E. Harang, “Crafting adversarial input sequences for recurrent neural networks,” in 2016 IEEE Military Communications Conference, MILCOM 2016, Baltimore, MD, USA, November 1-3, 2016, 2016, pp. 49–54.
- M. Sato, J. Suzuki, H. Shindo, and Y. Matsumoto, “Interpretable adversarial perturbation in input embedding space for text,” 2018, pp. 4323–4330.
- J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “Hotflip: White-box adversarial examples for text classification,” 2018, pp. 31–36.
- Y. Gil, Y. Chai, O. Gorodissky, and J. Berant, “White-to-black: Efficient distillation of black-box adversarial attacks,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), 2019, pp. 1373–1379.
- S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit, “Understanding robustness of transformers for image classification,” in 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 2021, pp. 10 211–10 221.
- S. Paul and P. Chen, “Vision transformers are robust learners,” in Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, 2022, pp. 2071–2081.
- I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
- N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami, “The limitations of deep learning in adversarial settings,” in IEEE European Symposium on Security and Privacy, EuroS&P 2016, Saarbrücken, Germany, March 21-24, 2016, 2016, pp. 372–387.
- Zhiyuan Chang (8 papers)
- Mingyang Li (86 papers)
- Junjie Wang (164 papers)
- Cheng Li (1094 papers)
- Boyu Wu (8 papers)
- Fanjiang Xu (16 papers)
- Qing Wang (341 papers)