Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VRPTEST: Evaluating Visual Referring Prompting in Large Multimodal Models (2312.04087v1)

Published 7 Dec 2023 in cs.CV and cs.AI

Abstract: With recent advancements in Large Multimodal Models (LMMs) across various domains, a novel prompting method called visual referring prompting has emerged, showing significant potential in enhancing human-computer interaction within multimodal systems. This method offers a more natural and flexible approach to human interaction with these systems compared to traditional text descriptions or coordinates. However, the categorization of visual referring prompting remains undefined, and its impact on the performance of LMMs has yet to be formally examined. In this study, we conduct the first comprehensive analysis of LMMs using a variety of visual referring prompting strategies. We introduce a benchmark dataset called VRPTEST, comprising 3 different visual tasks and 2,275 images, spanning diverse combinations of prompt strategies. Using VRPTEST, we conduct a comprehensive evaluation of eight versions of prominent open-source and proprietary foundation models, including two early versions of GPT-4V. We develop an automated assessment framework based on software metamorphic testing techniques to evaluate the accuracy of LMMs without the need for human intervention or manual labeling. We find that the current proprietary models generally outperform the open-source ones, showing an average accuracy improvement of 22.70%; however, there is still potential for improvement. Moreover, our quantitative analysis shows that the choice of prompt strategy significantly affects the accuracy of LMMs, with variations ranging from -17.5% to +7.3%. Further case studies indicate that an appropriate visual referring prompting strategy can improve LMMs' understanding of context and location information, while an unsuitable one might lead to answer rejection. We also provide insights on minimizing the negative impact of visual referring prompting on LMMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. 123 IQtest. https://www.123test.com/iq-test/.
  2. brght IQtest. https://brght.org/.
  3. Copilot. https://github.com/features/copilot.
  4. CSUN IQtest. https://www.csun.edu/~gip78758/iqtest/.
  5. Google Translate. https://translate.google.com/.
  6. Latex Forum. https://latex.org/forum/.
  7. Latex Tutorial. https://latex-tutorial.com/.
  8. Visual Referring Prompting. https://github.com/tszdanger/VisualReferPrompt.
  9. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  10. Metamorphic testing: a new approach for generating next test cases. Technical report, Technical Report HKUST-CS98-01, Department of Computer Science, Hong Kong …, 1998.
  11. InstructBLIP: Towards general-purpose vision-language models with instruction tuning, 2023.
  12. Pptc benchmark: Evaluating large language models for powerpoint task completion, 2023.
  13. Emotionally numb or empathetic? evaluating how LLMs feel using emotionBench. arXiv preprint arXiv:2308.03656, 2023.
  14. Trustgpt: A benchmark for trustworthy and responsible large language models, 2023.
  15. HuggingFace. Open-source large language models leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  16. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916, 2022.
  17. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022.
  18. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023.
  19. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336, 2022.
  20. Cctest: Testing and repairing code completion systems. arXiv preprint arXiv:2208.08289, 2022.
  21. Split and merge: Aligning position biases in large language model based evaluators. arXiv preprint arXiv:2310.01432, 2023.
  22. Protecting intellectual property of large language model-based code generation apis via watermarks. In Proceedings of the 30th ACM Conference on Computer and Communication Security (ACM CCS 2023), 2023.
  23. Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v(ision), llava-1.5, and other multi-modality models, 2023.
  24. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  25. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 2023.
  26. LMSYS. Chatbot arena: Benchmarking llms in the wild with elo ratings. https://lmsys.org, 2023.
  27. A survey of image classification methods and techniques for improving classification performance. International journal of Remote sensing, 28(5):823–870, 2007.
  28. Mathvista: Evaluating mathematical reason-ing of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  29. Openvidial: A large-scale, open-domain dialogue dataset with visual contexts. arXiv preprint arXiv:2012.15015, 2020.
  30. OpenAI. GPT-4V(ision) system card, 2023.
  31. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  32. Multitask prompted training enables zero-shot task generalization. In ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
  33. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics, 2020.
  34. Medical image generation using generative adversarial networks: A review. Health informatics: A computational perspective in healthcare, pages 77–96, 2021.
  35. Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138, 2022.
  36. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223, 2017.
  37. Automatic testing and improvement of machine translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pages 974–985, 2020.
  38. From humans to machines: can chatgpt-like llms effectively replace human annotators in nlp tasks. In Workshop Proceedings of the 17th International AAAI Conference on Web and Social Media, 2023.
  39. Reef: A framework for collecting real-world vulnerabilities and fixes. arXiv preprint arXiv:2309.08115, 2023.
  40. A unified framework for mini-game testing: Experience on wechat. In Proceedings of the 31th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023.
  41. Metamorphic object insertion for testing object detection systems. In ASE, 2020.
  42. Cogvlm: Visual expert for pretrained language models, 2023.
  43. Rationale-augmented ensembles in language models. arXiv preprint arXiv:2207.00747, 2022.
  44. Self-consistency improves chain of thought reasoning in language models. CoRR, abs/2203.11171, 2022.
  45. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  46. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  47. A survey of image synthesis and editing with generative adversarial networks. Tsinghua Science and Technology, 22(6):660–674, 2017.
  48. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 2023.
  49. mPlug-Owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  50. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
  51. Perception matters: Detecting perception failures of vqa models using metamorphic testing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16908–16917, 2021.
  52. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  53. DeepRoad: GAN-based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems. In ASE, 2018.
  54. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
  55. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Zongjie Li (29 papers)
  2. Chaozheng Wang (28 papers)
  3. Chaowei Liu (3 papers)
  4. Pingchuan Ma (90 papers)
  5. Daoyuan Wu (39 papers)
  6. Shuai Wang (466 papers)
  7. Cuiyun Gao (97 papers)
Citations (5)