2000 character limit reached
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision-Language Models (2402.18409v4)
Published 28 Feb 2024 in cs.AI, cs.CL, and cs.CV
Abstract: Large Vision-LLMs (LVLMs), despite their recent success, are hardly comprehensively tested for their cognitive abilities. Inspired by the prevalent use of the Cookie Theft task in human cognitive tests, we propose a novel evaluation benchmark to evaluate high-level cognitive abilities of LVLMs using images with rich semantics. The benchmark consists of 251 images along with comprehensive annotations. It defines eight reasoning capabilities and comprises an image description task and a visual question answering task. Our evaluation of well-known LVLMs shows that there is still a significant gap in cognitive abilities between LVLMs and humans.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.
- Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Hl dataset: visually-grounded description of scenes, actions and rationales. In Proceedings of the 16th International Natural Language Generation Conference, pages 293–312.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
- Measuring and improving chain-of-thought reasoning in vision-language models. arXiv preprint arXiv:2309.04461.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Louise Cummings. 2019. Describing the cookie theft picture: Sources of breakdown in alzheimer’s dementia. Pragmatics and Society, 10:151–174.
- Instructblip: Towards general-purpose vision-language models with instruction tuning.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
- There’s a Time and Place for Reasoning Beyond the Image. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL).
- BDAE: The Boston Diagnostic Aphasia Examination. Lippincott Williams & Wilkins, Philadelphia, PA.
- Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations.
- The abduction of sherlock holmes: A dataset for visual abductive reasoning. In European Conference on Computer Vision, pages 558–575. Springer.
- A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 317–325.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Improved baselines with visual instruction tuning.
- Visual instruction tuning. In NeurIPS.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
- OpenAI. 2023. Gpt-4 technical report.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Visualcomet: Reasoning about the dynamic context of a still image. In In Proceedings of the European Conference on Computer Vision (ECCV).
- Bleurt: Learning robust metrics for text generation. In Proceedings of ACL.
- DEPAC: a corpus for depression and anxiety detection from speech. In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, pages 1–16, Seattle, USA. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
- Visual clues: Bridging vision and language foundations for image paragraph captioning. Advances in Neural Information Processing Systems, 35:17287–17300.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1.
- mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257.
- DocNLI: A large-scale dataset for document-level natural language inference. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4913–4922, Online. Association for Computational Linguistics.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
- From recognition to cognition: Visual commonsense reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- Bertscore: Evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
- Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. arXiv preprint arXiv:2306.10512.
- Mindstorms in natural language-based societies of mind. arXiv preprint arXiv:2305.17066.