Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

How to Determine the Preferred Image Distribution of a Black-Box Vision-Language Model? (2409.02253v3)

Published 3 Sep 2024 in cs.CV

Abstract: Large foundation models have revolutionized the field, yet challenges remain in optimizing multi-modal models for specialized visual tasks. We propose a novel, generalizable methodology to identify preferred image distributions for black-box Vision-LLMs (VLMs) by measuring output consistency across varied input prompts. Applying this to different rendering types of 3D objects, we demonstrate its efficacy across various domains requiring precise interpretation of complex structures, with a focus on Computer-Aided Design (CAD) as an exemplar field. We further refine VLM outputs using in-context learning with human feedback, significantly enhancing explanation quality. To address the lack of benchmarks in specialized domains, we introduce CAD-VQA, a new dataset for evaluating VLMs on CAD-related visual question answering tasks. Our evaluation of state-of-the-art VLMs on CAD-VQA establishes baseline performance levels, providing a framework for advancing VLM capabilities in complex visual reasoning tasks across various fields requiring expert-level visual interpretation. We release the dataset and evaluation codes at \url{https://github.com/asgsaeid/cad_vqa}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Anthropic. Claude-3-sonnet. https://www.anthropic.com, 2024. Accessed: 2024-08-27.
  2. Interactive language learning by question answering. arXiv preprint arXiv:2201.08540, 2022.
  3. Rishi Bommasani et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  4. Tom Brown et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 2020.
  5. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In European Conference on Computer Vision, pages 565–580. Springer, 2020.
  6. Mark Chen et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  7. Unifying vision-and-language tasks via text generation. arXiv preprint arXiv:2102.02779, 2021.
  8. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers. arXiv preprint arXiv:2109.04448, 2021.
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
  12. Google. Gemini-1.5-pro. https://deepmind.google/technologies/gemini/, 2024. Accessed: 2024-08-27.
  13. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330. PMLR, 2017.
  14. Probing image–language transformers for verb understanding. arXiv preprint arXiv:2106.09141, 2021.
  15. Abc: A big cad model dataset for geometric deep learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9601–9611, 2019.
  16. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  17. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, 2004.
  18. Pengfei Liu et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586, 2021.
  19. Jiasen Lu et al. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
  20. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  21. OpenAI. Gpt-4o. https://openai.com/research/gpt-4, 2023. Accessed: 2024-08-27.
  22. Long Ouyang et al. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
  24. Alec Radford et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  25. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv preprint arXiv:2102.07350, 2021.
  26. Victor Sanh et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2022.
  27. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021, 2020.
  28. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP, 2019.
  29. Maria Tsimpoukelli et al. Multimodal few-shot learning with frozen language models. In NeurIPS, 2021.
  30. Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022.
  31. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  32. Human preference optimization: Economic decision making for ai alignment. arXiv preprint arXiv:2310.03026, 2023.
  33. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  34. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.