Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models (2306.09265v1)

Published 15 Jun 2023 in cs.CV and cs.AI
LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Abstract: Large Vision-LLMs (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates $6$ categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on $47$ standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM with moderate instruction-following data may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions). It either makes the current evaluation metric such as CIDEr for image captioning ineffective or generates wrong answers. Third, employing a multi-turn reasoning evaluation framework can mitigate the issue of object hallucination, shedding light on developing an effective pipeline for LVLM evaluation. The findings provide a foundational framework for the conception and assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques. Our LVLM-eHub will be available at https://github.com/OpenGVLab/Multi-Modality-Arena

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-LLMs

The paper introduces LVLM-eHub, a robust benchmarking framework designed to systematically evaluate Large Vision-LLMs (LVLMs). The development of LVLMs has shown significant progress in integrating visual and textual data for diverse multimodal tasks, yet a comprehensive evaluation covering their full capabilities remains limited. This paper addresses this gap by presenting LVLM-eHub, evaluating both quantitative performance and qualitative human feedback.

The LVLM-eHub evaluates eight representative models, such as InstructBLIP and MiniGPT-4, focusing on six categories of capabilities: visual perception, visual knowledge acquisition, visual reasoning, visual commonsense, object hallucination, and embodied intelligence. Evaluation is performed across 47 text-related visual benchmarks, offering a multifaceted understanding of LVLMs' strengths and challenges.

Key Findings

  1. Visual Perception: LVLMs were assessed on tasks such as image classification, object counting, and multi-class identification. Results indicate that models like InstructBLIP, which have undergone extensive fine-tuning on domain-specific data, excel in these tasks, although they risk overfitting.
  2. Visual Knowledge Acquisition: In tasks like OCR and image captioning, models utilizing large visual encoders and substantial instruction-tuning data, such as InstructBLIP, achieved superior performance, highlighting the impact of robust visual-textual alignment.
  3. Visual Reasoning and Commonsense: For reasoning tasks, instruction-tuned models demonstrated success with multi-turn reasoning frameworks, underscoring the importance of effective evaluation schemes to reduce object hallucination.
  4. Object Hallucination: The paper identifies a tendency among LVLMs to generate inconsistent descriptions with target images. Standard metrics like CIDEr may inadequately evaluate these outputs, highlighting a need for improved evaluation methodologies.
  5. Embodied Intelligence: The evaluation covered embodied tasks requiring interactive environmental engagement. Models like LLaMA-Adapter V2 outperformed others due to comprehensive vision-language instruction.
  6. Open-world Evaluation: The LVLM Arena component of LVLM-eHub enables human-feedback-driven evaluation, capturing LVLMs' performance in real-world scenarios. Models with extensive instruction-following data, such as mPLUG-Owl, ranked highly under this criterion.

Implications and Future Directions

The LVLM-eHub framework provides a foundational platform for comparing LVLMs, offering insights that guide their development. The findings emphasize the vital role of diverse data and refined instruction tuning to enhance LVLMs' adaptability and generalization. The paper challenges traditional evaluation metrics like CIDEr, advocating for the development of more nuanced evaluation strategies.

In terms of future advancements, the paper posits that innovations in multi-turn reasoning techniques and more sophisticated human-centered evaluations can further elucidate LVLMs’ capabilities, particularly in open-ended tasks. Furthermore, expanding the scope of LVLM-eHub with newer models and tasks will progressively improve our understanding and benchmarking of LVLM efficacy.

In conclusion, LVLM-eHub represents a significant step toward comprehensively evaluating the rapidly evolving LVLM landscape. By integrating robust metric-driven assessments with qualitative evaluations, it provides an invaluable resource for researchers aiming to enhance multimodal machine learning technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  2. Language models are few-shot learners, 2020.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  4. OpenAI. Gpt-4 technical report, 2023.
  5. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  6. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  7. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  8. Transfer visual prompt generator across llms. arXiv preprint arXiv:2305.01278, 2023.
  9. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  10. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  11. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  12. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  13. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  14. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  15. Imagenetvc: Zero-shot visual commonsense evaluation on 1000 imagenet categories. arXiv preprint arXiv:2305.15028, 2023.
  16. On the hidden mystery of ocr in large multimodal models. arXiv preprint arXiv:2305.07895, 2023.
  17. What makes for good visual tokenizers for large language models? arXiv preprint arXiv:2305.12223, 2023.
  18. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  19. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  20. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568, 2021.
  21. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  22. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  23. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  24. Theo Coombes Richard Vencu Benjamin Trom Christoph Schuhmann, Andreas Köpf and Romain Beaumont. Laion coco: 600m synthetic captions from laion2b-en, Oct 2022.
  25. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  26. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  27. Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  28. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  29. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  30. Top-down and bottom-up cues for scene text recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2687–2694, 2012.
  31. Icdar 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pages 1484–1493, 2013.
  32. Icdar 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1156–1160, 2015.
  33. Total-text: A comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), volume 01, pages 935–942, 2017.
  34. A robust arbitrary text detection system for natural scene images. Expert Systems with Applications, 41(18):8027–8048, 2014.
  35. End-to-end scene text recognition using tree-structured models. Pattern Recognition, 47(9):2853–2866, 2014.
  36. Recognizing text with perspective distortion in natural scenes. In 2013 IEEE International Conference on Computer Vision, pages 569–576, 2013.
  37. Coco-text: Dataset and benchmark for text detection and recognition in natural images. ArXiv, abs/1601.07140, 2016.
  38. Toward understanding wordart: Corner-guided transformer for scene text recognition. 2022.
  39. Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recogn., 90(C):337–345, jun 2019.
  40. From two to one: A new scene text recognizer with visual language modeling network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14194–14203, 2021.
  41. Icdar2019 competition on scanned receipt ocr and information extraction. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1516–1520. IEEE, 2019.
  42. Funsd: A dataset for form understanding in noisy scanned documents. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), volume 2, pages 1–6. IEEE, 2019.
  43. nocaps: novel object captioning at scale. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8947–8956, 2019.
  44. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 02 2014.
  45. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  46. Towards vqa models that can read. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8309–8318, 2019.
  47. Icdar 2019 competition on scene text visual question answering. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 1563–1570, 2019.
  48. Ocr-vqa: Visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pages 947–952, 2019.
  49. Ok-vqa: A visual question answering benchmark requiring external knowledge. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3190–3199, 2019.
  50. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6693–6702, 2019.
  51. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021.
  52. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 2023.
  53. Visual Dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  54. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  55. Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pages 333–342, 2010.
  56. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
  57. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  58. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  59. Virtualhome: Simulating household activities via programs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8494–8502, 2018.
  60. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
  61. Wei-Lin Chiang Hao Zhang Joseph E. Gonzalez Lianmin Zheng, Ying Sheng and Ion Stoica. Fastchat. https://github.com/lm-sys/FastChat, 2023.
  62. Symbolic discovery of optimization algorithms, 2023.
  63. H M Dipu Kabir. Reduction of class activation uncertainty with background information, 2023.
  64. Dinov2: Learning robust visual features without supervision, 2023.
  65. Fine-grained visual classification via internal ensemble learning transformer. IEEE Transactions on Multimedia, pages 1–14, 2023.
  66. Clip4str: A simple baseline for scene text recognition with pre-trained vision-language model, 2023.
  67. Scene text recognition with permuted autoregressive sequence models. In European Conference on Computer Vision, pages 178–196, Cham, 10 2022. Springer Nature Switzerland.
  68. Centripetaltext: An efficient text instance representation for scene text detection. In Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
  69. Layoutlmv2: Multi-modal pre-training for visually-rich document understanding. In ACL-IJCNLP 2021, January 2021.
  70. Geolayoutlm: Geometric pre-training for visual information extraction. CoRR, abs/2304.10759, 2023.
  71. GIT: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research, 2022.
  72. Pali: A jointly-scaled multilingual language-image model. 2023.
  73. Coarse-to-fine reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4558–4566, 2022.
  74. Factor graph attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2039–2048, 2019.
  75. Visual instruction tuning, 2023.
  76. Augmented SBERT: Data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 296–310, Online, June 2021. Association for Computational Linguistics.
  77. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
  78. Arpad E Elo. The proposed uscf rating system. its development, theory, and applications. Chess Life, 22(8):242–247, 1967.
  79. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  80. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  81. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  82. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  83. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  84. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  85. Idealgpt: Iteratively decomposing vision and language reasoning via large language models, 2023.
  86. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  87. Breaking common sense: Whoops! a vision-and-language benchmark of synthetic and compositional images, 2023.
  88. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
  89. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Peng Xu (357 papers)
  2. Wenqi Shao (89 papers)
  3. Kaipeng Zhang (73 papers)
  4. Peng Gao (401 papers)
  5. Shuo Liu (123 papers)
  6. Meng Lei (8 papers)
  7. Fanqing Meng (14 papers)
  8. Siyuan Huang (123 papers)
  9. Yu Qiao (563 papers)
  10. Ping Luo (340 papers)
Citations (127)
Youtube Logo Streamline Icon: https://streamlinehq.com