MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs (2407.16837v2)
Abstract: The ability to compare objects, scenes, or situations is crucial for effective decision-making and problem-solving in everyday life. For instance, comparing the freshness of apples enables better choices during grocery shopping while comparing sofa designs helps optimize the aesthetics of our living space. Despite its significance, the comparative capability is largely unexplored in artificial general intelligence (AGI). In this paper, we introduce MLLM-CompBench, a benchmark designed to evaluate the comparative reasoning capability of multimodal LLMs (MLLMs). MLLM-CompBench mines and pairs images through visually oriented questions covering eight dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. We curate a collection of around 40K image pairs using metadata from diverse vision datasets and CLIP similarity scores. These image pairs span a broad array of visual domains, including animals, fashion, sports, and both outdoor and indoor scenes. The questions are carefully crafted to discern relative characteristics between two images and are labeled by human annotators for accuracy and relevance. We use MLLM-CompBench to evaluate recent MLLMs, including GPT-4V(ision), Gemini-Pro, and LLaVA-1.6. Our results reveal notable shortcomings in their comparative abilities. We believe MLLM-COMPBENCH not only sheds light on these limitations but also establishes a solid foundation for future enhancements in the comparative capability of MLLMs.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Nocaps: Novel object captioning at scale. In ICCV, 2019.
- Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Anthropic. Model card and evaluations for claude models. 2023. URL: {https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf}.
- Qwen-vl: A frontier large vision-language model with versatile abilities. In ICLR, 2024.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- Learning to rank: from pairwise approach to listwise approach. In ICML, 2007.
- Ordinal hyperplanes ranker with cost sensitivities for age estimation. In CVPR, 2011.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Uniter: Universal image-text representation learning. In ECCV, 2020.
- Deep reinforcement learning from human preferences. In NeurIPS, 2017.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2024.
- Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
- Mind2web: Towards a generalist agent for the web. In NeurIPS, 2024.
- Neural naturalist: Generating fine-grained image comparisons. In EMNLP, 2019.
- Pairwise preference learning and ranking. In ECML, 2003.
- Soccernet: A scalable dataset for action spotting in soccer videos. In CVPR Workshops, 2018.
- Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP, 2013.
- Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
- Inflection AI. Inflection-2. 2023. URL: https://inflection.ai/inflection-2.
- Discovering states and transformations in image collections. In CVPR, 2015.
- Learning to describe differences between pairs of similar images. In EMNLP, 2018.
- Fashionpedia: Ontology, segmentation, and an attribute localization dataset. In ECCV, 2020.
- Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
- Ordinal regression by extended binary classification. In NeurIPS, 2006.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV, 2020.
- Vila: On pre-training for visual language models. In CVPR, 2024.
- Visual instruction tuning. In NeurIPS, 2024.
- Tie-Yan Liu et al. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval, 3(3):225–331, 2009.
- Large-scale celebfaces attributes (celeba) dataset. In ICCV, 2015.
- Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
- Chartqa: A benchmark for question answering about charts with visual and logical reasoning. In ACL Findings, 2022.
- Infographicvqa. In WACV, 2022.
- Docvqa: A dataset for vqa on document images. In WACV, 2021.
- Ordinal regression with multiple output cnn for age estimation. In CVPR, 2016.
- Teaching clip to count to ten. In ICCV, 2023.
- Relative attributes. In ICCV, 2011.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, 2023.
- Learning to predict visual attributes in the wild. In CVPR, 2021.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Direct preference optimization: Your language model is secretly a reward model. In NeurIPS, 2024.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
- Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
- Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
- Towards vqa models that can read. In CVPR, 2019.
- A corpus for reasoning about natural language grounded in photographs. In ACL, 2019.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- The caltech-ucsd birds-200-2011 dataset. California Institute of Technology, 2011.
- Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
- xAI. Grok-1 model card. 2024. URL: https://x.ai/blog/grok/model-card.
- A large-scale car dataset for fine-grained categorization and verification. In CVPR, 2015.
- Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In CVPR, 2024.
- Magicbrush: A manually annotated dataset for instruction-guided image editing. In NeurIPS, 36, 2024.
- A benchmark for multi-modal foundation models on low-level vision: from single images to pairs. arXiv preprint arXiv:2402.07116, 2024.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024.
- Wildfish++: A comprehensive fish benchmark for multimedia research. IEEE Transactions on Multimedia, 23:3603–3617, 2020.