AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? (2410.21259v4)
Abstract: Large Vision-LLMs (LVLMs) have become essential for advancing the integration of visual and linguistic information. However, the evaluation of LVLMs presents significant challenges as the evaluation benchmark always demands lots of human cost for its construction, and remains static, lacking flexibility once constructed. Even though automatic evaluation has been explored in textual modality, the visual modality remains under-explored. As a result, in this work, we address a question: "Can LVLMs themselves be used to benchmark each other in the visual automatically domain?". We introduce AutoBench-V, an automated framework for serving evaluation on demand, i.e., benchmarking LVLMs based on specific aspects of model capability. AutoBench-V leverages text-to-image models to generate relevant image samples and then utilizes LVLMs to orchestrate visual question-answering (VQA) tasks, completing the evaluation process efficiently and flexibly. Through an extensive evaluation of nine popular LVLMs across five demanded user inputs (i.e., evaluation capabilities), the framework shows effectiveness and reliability.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Vqa: Visual question answering. arXiv preprint arXiv:1505.00468, 2016.
- Anthropic. Claude3family. Anthropic News, 2024. URL https://www.anthropic.com/news/claude-3-family.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
- Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181, 2023b.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
- blackforestlabs. flux_pro. BlackForestLabs News, 2024. URL https://blackforestlabs.ai/announcing-black-forest-labs/.
- Spatialbot: Precise spatial understanding with vision language models. arXiv preprint arXiv:2406.13642, 2024.
- Gui-world: A dataset for gui-oriented multimodal llm-based agents. arXiv preprint arXiv:2406.10819, 2024a.
- Are we on the right way for evaluating large vision-language models? arXiv preprint arXiv:2403.20330, 2024b.
- DeepMind. Gemini-1-5-flash. DeepMind Technologies, 2024. URL https://deepmind.google/technologies/gemini/flash/.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2024.
- Multimodal emotion recognition with deep learning: advancements, challenges, and future directions. Information Fusion, 105:102218, 2024.
- Deep learning approaches on image captioning: A review. ACM Computing Surveys, 56(3):1–39, 2023.
- Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. arXiv preprint arXiv:1612.00837, 2017.
- Regiongpt: Towards region understanding vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13796–13806, 2024.
- Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. arXiv preprint arXiv:2303.11897, 2023.
- Metatool benchmark for large language models: Deciding whether to use tools and which to use. arXiv preprint arXiv:2310.03128, 2023.
- Position: Trustllm: Trustworthiness in large language models. In International Conference on Machine Learning, pp. 20166–20270. PMLR, 2024.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv preprint arXiv:1412.2306, 2015.
- A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Seed-bench-2: Benchmarking multimodal large language models. arXiv preprint arXiv:2311.17092, 2023a.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
- Autobencher: Creating salient, novel, difficult datasets for language models. arXiv preprint arXiv:2407.08351, 2024.
- Microsoft coco: Common objects in context. arXiv preprint arXiv:1405.0312, 2015.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
- Alignbench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743, 2023a.
- Alignbench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743, 2024b.
- Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024c.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
- Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032, 2023c.
- Mmiu: Multimodal multi-image understanding for evaluating large vision-language models. arXiv preprint arXiv:2408.02718, 2024.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- M Ross Quillian. Semantic memory. Air Force Cambridge Research Laboratories, Office of Aerospace Research …, 1966.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Imagenet large scale visual recognition challenge. arXiv preprint arXiv:1409.0575, 2015.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. arXiv preprint arXiv:1706.03762, 2023.
- Prompt2model: Generating deployable models from natural language instructions. arXiv preprint arXiv:2308.12261, 2023.
- Unigen: A unified framework for textual dataset generation using large language models. arXiv preprint arXiv:2406.18966, 2024.
- Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2411–2418, 2013.
- Configurable foundation models: Building llms from a modular perspective. arXiv preprint arXiv:2409.02877, 2024a.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2024b.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- A survey of scene understanding by event reasoning in autonomous driving. International Journal of Automation and Computing, 15(3):249–266, 2018.
- Justice or prejudice? quantifying biases in llm-as-a-judge. arXiv preprint arXiv:2410.02736, 2024.
- Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.
- Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024.
- Idealgpt: Iteratively decomposing vision and language reasoning via large language models. arXiv preprint arXiv:2305.14985, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Mm-vet v2: A challenging benchmark to evaluate large multimodal models for integrated capabilities. arXiv preprint arXiv:2408.00765, 2024.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2024.
- Task me anything. arXiv preprint arXiv:2406.11775, 2024a.
- Debiasing multimodal large language models. arXiv preprint arXiv:2403.05262, 2024b.
- Self-guide: Better task-specific instruction following via self-synthetic finetuning. arXiv preprint arXiv:2407.12874, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
- Navgpt: Explicit reasoning in vision-and-language navigation with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38(7), pp. 7641–7649, 2024.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Dyval: Dynamic evaluation of large language models for reasoning tasks. arXiv preprint arXiv:2309.17167, 2024a.
- Dynamic evaluation of large language models by meta probing agents. arXiv preprint arXiv:2402.14865, 2024b.
- Object detection in 20 years: A survey. arXiv preprint arXiv:1905.05055, 2023.