Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How do LLMs Support Deep Learning Testing? A Comprehensive Study Through the Lens of Image Mutation (2404.13945v2)

Published 22 Apr 2024 in cs.SE

Abstract: Visual deep learning (VDL) systems have shown significant success in real-world applications like image recognition, object detection, and autonomous driving. To evaluate the reliability of VDL, a mainstream approach is software testing, which requires diverse and controllable mutations over image semantics. The rapid development of multi-modal LLMs (MLLMs) has introduced revolutionary image mutation potentials through instruction-driven methods. Users can now freely describe desired mutations and let MLLMs generate the mutated images. However, the quality of MLLM-produced test inputs in VDL testing remains largely unexplored. We present the first study, aiming to assess MLLMs' adequacy from 1) the semantic validity of MLLM mutated images, 2) the alignment of MLLM mutated images with their text instructions (prompts), 3) the faithfulness of how different mutations preserve semantics that are ought to remain unchanged, and 4) the effectiveness of detecting VDL faults. With large-scale human studies and quantitative evaluations, we identify MLLM's promising potentials in expanding the covered semantics of image mutations. Notably, while SoTA MLLMs (e.g., GPT-4V) fail to support or perform worse in editing existing semantics in images (as in traditional mutations like rotation), they generate high-quality test inputs using "semantic-additive" mutations (e.g., "dress a dog with clothes"), which bring extra semantics to images; these were infeasible for past approaches. Hence, we view MLLM-based mutations as a vital complement to traditional mutations, and advocate future VDL testing tasks to combine MLLM-based methods and traditional image mutations for comprehensive and reliable testing.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Amazon mechanical turk. https://www.mturk.com/.
  2. Kaggle competition: Dog breed identification. https://www.kaggle.com/competitions/dog-breed-identification.
  3. Dalle. https://openai.com/dall-e-2, 2023.
  4. Stable diffusion online. https://github.com/CompVis/stable-diffusion, 2023.
  5. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  6. Program synthesis with large language models. 2021.
  7. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi, editors, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 675–718, Nusa Dua, Bali, November 2023. Association for Computational Linguistics.
  8. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
  9. The (r) evolution of multimodal large language models: A survey. arXiv preprint arXiv:2402.12451, 2024.
  10. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  11. Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing. arXiv preprint arXiv:2311.00571, 2023.
  12. Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of llms. arXiv preprint arXiv:2304.11164, 2023.
  13. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
  14. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009.
  15. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis, pages 423–435, 2023.
  16. Distribution-aware testing of neural networks using generative models. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 226–237. IEEE, 2021.
  17. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  18. Stable diffusion is unstable. Advances in Neural Information Processing Systems, 36, 2024.
  19. Exposing previously undetectable faults in deep neural networks. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 56–66, 2021.
  20. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1469–1481. IEEE, 2023.
  21. Training-free structured diffusion guidance for compositional text-to-image synthesis. In The Eleventh International Conference on Learning Representations, 2023.
  22. Guiding Instruction-based Image Editing via Multimodal Large Language Models. In International Conference on Learning Representations (ICLR), 2024.
  23. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  24. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2018.
  25. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
  26. Is neuron coverage a meaningful measure for testing deep neural networks? In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 851–862, 2020.
  27. Ganspace: Discovering interpretable gan controls. Advances in neural information processing systems, 33:9841–9850, 2020.
  28. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  29. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pages 6626–6637, 2017.
  30. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  31. Dreamtuner: Single image is enough for subject-driven generation. arXiv preprint arXiv:2312.13691, 2023.
  32. Diffusion model-based image editing: A survey. arXiv preprint arXiv:2402.17525, 2024.
  33. Smartedit: Exploring complex instruction-based image editing with multimodal large language models. arXiv preprint arXiv:2312.06739, 2023.
  34. Jigsaw: Large language models meet program synthesis. In Proc. ACM ICSE, 2022.
  35. Sinvad: Search-based image space navigation for dnn image classifier test input generation. In Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, pages 521–528, 2020.
  36. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  37. Exploiting spatial dimensions of latent in gan for real-time image editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 852–861, 2021.
  38. Revisiting image pyramid structure for high resolution salient object detection. In Proceedings of the Asian Conference on Computer Vision, pages 108–124, 2022.
  39. Improved precision and recall metric for assessing generative models. Advances in neural information processing systems, 32, 2019.
  40. Nonlinear dimensionality reduction. Springer Science & Business Media, 2007.
  41. Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023.
  42. Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems, 34:16331–16345, 2021.
  43. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  44. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  45. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
  46. DeepXplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP ’17, pages 1–18, 2017.
  47. Domain knowledge matters: Improving prompts with fix templates for repairing python type errors. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024.
  48. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  49. Impact of pretraining term frequencies on few-shot numerical reasoning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  50. Assessing generative models via precision and recall. Advances in Neural Information Processing Systems, 31, 2018.
  51. Improved techniques for training gans. Advances in neural information processing systems, 29:2234–2242, 2016.
  52. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9243–9252, 2020.
  53. Emu edit: Precise image editing via recognition and generation tasks. arXiv preprint arXiv:2311.10089, 2023.
  54. Moso: Decomposing motion, scene and object for video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18727–18737, 2023.
  55. GPTScan: Detecting logic vulnerabilities in smart contracts by combining GPT with program analysis. In Proc. ACM ICSE, 2024.
  56. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering, pages 303–314, 2018.
  57. Metamorphic object insertion for testing object detection systems. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pages 1053–1065, 2020.
  58. CogVLM: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  59. Enchanting program specification synthesis by large language models using static analysis and program verification. In Proc. Springer CAV, 2022.
  60. Universal fuzzing via large language models. Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024.
  61. Automated program repair in the era of large pre-trained language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1482–1494. IEEE, 2023.
  62. Deephunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 146–157, 2019.
  63. Filip: Fine-grained interactive language-image pre-training. In International Conference on Learning Representations, 2021.
  64. Automated testing of image captioning systems. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 467–479, 2022.
  65. Provably valid and diverse mutations of real-world media data for dnn testing. IEEE Transactions on Software Engineering, 2024.
  66. Perception matters: Detecting perception failures of vqa models using metamorphic testing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16908–16917, 2021.
  67. Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 132–142, 2018.
  68. Real-world image variation by aligning diffusion inversion chain. Advances in Neural Information Processing Systems, 36, 2024.
  69. MI Zhenxing and Dan Xu. Switch-nerf: Learning scene decomposition with mixture of experts for large-scale neural radiance fields. In The Eleventh International Conference on Learning Representations, 2022.
  70. Regionclip: Region-based language-image pretraining. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16772–16782. IEEE Computer Society, 2022.
  71. Low-rank subspaces in gans. Advances in Neural Information Processing Systems, 34:16648–16658, 2021.
  72. In-domain gan inversion for real image editing. In European conference on computer vision, pages 592–608. Springer, 2020.
  73. Jun-Yan Zhu. CycleGAN Failure Cases. http://github.com/junyanz/CycleGAN#failure-cases, 2021.
  74. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Liwen Wang (18 papers)
  2. Yuanyuan Yuan (15 papers)
  3. Ao Sun (53 papers)
  4. Zongjie Li (29 papers)
  5. Pingchuan Ma (91 papers)
  6. Daoyuan Wu (39 papers)
  7. Shuai Wang (466 papers)

Summary

We haven't generated a summary for this paper yet.