Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HEMM: Holistic Evaluation of Multimodal Foundation Models (2407.03418v1)

Published 3 Jul 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Multimodal foundation models that can holistically process text alongside images, video, audio, and other sensory modalities are increasingly used in a variety of real-world applications. However, it is challenging to characterize and study progress in multimodal foundation models, given the range of possible modeling decisions, tasks, and domains. In this paper, we introduce Holistic Evaluation of Multimodal Models (HEMM) to systematically evaluate the capabilities of multimodal foundation models across a set of 3 dimensions: basic skills, information flow, and real-world use cases. Basic multimodal skills are internal abilities required to solve problems, such as learning interactions across modalities, fine-grained alignment, multi-step reasoning, and the ability to handle external knowledge. Information flow studies how multimodal content changes during a task through querying, translation, editing, and fusion. Use cases span domain-specific challenges introduced in real-world multimedia, affective computing, natural sciences, healthcare, and human-computer interaction applications. Through comprehensive experiments across the 30 tasks in HEMM, we (1) identify key dataset dimensions (e.g., basic skills, information flows, and use cases) that pose challenges to today's models, and (2) distill performance trends regarding how different modeling dimensions (e.g., scale, pre-training data, multimodal alignment, pre-training, and instruction tuning objectives) influence performance. Our conclusions regarding challenging multimodal interactions, use cases, and tasks requiring reasoning and external knowledge, the benefits of data and model scale, and the impacts of instruction tuning yield actionable insights for future work in multimodal foundation models.

Holistic Evaluation of Multimodal Models (HEMM)

The proliferation of multimodal foundation models capable of processing heterogeneous data types, such as text, images, video, and audio, necessitates rigorous and comprehensive evaluation standards. The paper "HEMM: Holistic Evaluation of Multimodal Models" by Liang et al. addresses this need by introducing a structured framework to evaluate the efficacy of these multimodal models. In doing so, it transcends the limitations of earlier benchmarks that focused narrowly on specific datasets or tasks.

Evaluation Framework

The HEMM framework encompasses three distinct dimensions to holistically evaluate multimodal models: basic multimodal skills, information flow, and real-world use cases. This tri-dimensional schema provides a clear taxonomy which is critical for analyzing these models comprehensively.

  1. Basic Multimodal Skills: These foundational abilities cover:
    • Multimodal interactions: Redundant, unique, and synergistic interactions between different modalities.
    • Granularity of alignment: Identification and alignment of elements across modalities at varying granularity levels.
    • Reasoning and external knowledge: Skills necessary for more advanced tasks requiring multi-step inference and integration of external domain-specific knowledge.
  2. Multimodal Information Flow: This dimension assesses how information is transformed in the context of tasks:
    • Translation: Mapping data from one modality to another.
    • Editing: Semantic editing of content across modalities.
    • Querying: Answering questions about multimodal inputs.
    • Fusion: Integration of information from multiple modalities to generate insights.
  3. Real-world Use Cases: Covering a breadth of domains such as multimedia, affective computing, healthcare, natural sciences, and human-computer interaction, this dimension evaluates the practical application of these models.

HEMM Evaluation Protocol

To implement this evaluation, HEMM uses a collection of 30 datasets, each assessed for different multimodal skills and categorized based on their specific challenges. These datasets are set against an array of diverse tasks such as visual question answering (VQA), image captioning, medical image analysis, and meme understanding. By doing so, the paper ensures that the evaluation suite captures a wide spectrum of real-world challenges.

A significant feature of HEMM is its use of normalized BARTScore to aggregate performance across various metrics. This measure has been shown to align well with human judgment, providing a robust metric for text generation tasks.

Findings and Implications

Through extensive experimentation, the paper presents several key insights:

  1. Challenging Domains: The evaluation highlights that healthcare, natural sciences, and HCI pose significant challenges for current models. For example, datasets like Decimer (chemical structure recognition) and PathVQA (medical image analysis) consistently rank among the hardest, indicating substantial room for improvement in these domains.
  2. Reasoning and Knowledge: Models exhibit significantly lower performance on tasks requiring external knowledge and complex reasoning. This is evident in datasets like iNaturalist and MemeCap, where fine-grained identification and cultural context understanding are imperative.
  3. Model Scale and Data: Larger model scales and diversified training data sets notably enhance performance. However, the benefits plateau at a certain point, suggesting diminishing returns beyond certain scales.
  4. Instruction Tuning: Instruction-tuned models demonstrate superior performance, especially on translation tasks requiring generating meaningful textual content from visual data. This suggests that such models benefit from an additional tuning phase that aligns their outputs more closely with human expectations.

Future Directions

The implications of these findings are manifold for the field of AI and multimodal research. Future work can explore the areas highlighted as challenging by HEMM, particularly healthcare and natural sciences, to develop more robust and contextually aware models. Enhancing capabilities in reasoning and external knowledge integration appears paramount. This suggests the need for richer datasets and pre-training methods that better capture human-like reasoning.

Moreover, while instruction tuning has shown promise, the paper points out the need for even broader and more diverse instruction datasets. Future research could aim at incorporating more varied and rich instructions to fine-tune these models, thus improving their generalizability and adherence to task-specific nuances.

Conclusion

HEMM sets a new standard for the evaluation of multimodal foundation models by focusing on a holistic approach that encompasses fundamental skills, information flow, and real-world applications. Liang et al. provide a comprehensive and structured evaluation framework that not only identifies the current shortcomings of multimodal models but also offers actionable insights for future improvement. As multimodal models become increasingly integral to AI, frameworks like HEMM will be indispensable in guiding their development and deployment across diverse domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (127)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957, 2019.
  3. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  4. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  5. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992, 2017.
  6. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  7. Touchstone: Evaluating vision-language models by language models. arXiv preprint arXiv:2308.16890, 2023.
  8. Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20041–20053, 2023.
  9. John Bateman. Text and image: A critical introduction to the visual/verbal divide. Routledge, 2014.
  10. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  11. Visit-bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
  12. Multimodal learning analytics and education data mining: Using computational technologies to measure complex learning tasks. Journal of Learning Analytics, 3(2):220–238, 2016.
  13. Decimer—hand-drawn molecule images dataset. Journal of Cheminformatics, 14(1):1–4, 2022.
  14. Revisiting the" video" in video-language understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2917–2927, 2022.
  15. Multi-modal sarcasm detection in Twitter with hierarchical fusion model. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2506–2515, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1239. URL https://aclanthology.org/P19-1239.
  16. Webqa: Multihop and multimodal qa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16495–16504, 2022.
  17. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  18. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
  19. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024.
  20. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  21. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023.
  22. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning, June 2023. URL http://arxiv.org/abs/2305.06500. arXiv:2305.06500 [cs].
  23. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854, 2017.
  24. A survey of vision-language pre-trained models. arXiv preprint arXiv:2202.10936, 2022.
  25. Multimodal interfaces: A survey of principles, models and frameworks. In Human machine interaction: Research results of the mmi program, pages 3–26. Springer, 2009.
  26. Modelling fusion of modalities in multimodal interactive systems with mmmm. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 288–296, 2017.
  27. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021.
  28. A survey of current datasets for vision and language research. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 207–213, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1021. URL https://aclanthology.org/D15-1021.
  29. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  30. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  31. What makes the difference? an empirical comparison of fusion strategies for multimodal language analysis. Information Fusion, 66:184–197, 2021.
  32. Challenges in representation learning: A report on three machine learning contests. In Neural Information Processing: 20th International Conference, ICONIP 2013, Daegu, Korea, November 3-7, 2013. Proceedings, Part III 20, pages 117–124. Springer, 2013.
  33. Magpie: A large corpus of potentially idiomatic expressions. In 12th Language Resources and Evaluation Conference: LREC 2020, pages 279–287. European Language Resources Association (ELRA), 2020.
  34. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):1–19, 2015.
  35. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
  36. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  37. Do androids laugh at electric sheep? humor" understanding" benchmarks from the new yorker caption contest. arXiv preprint arXiv:2209.06293, 2022.
  38. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47:853–899, 2013.
  39. ‘a picture is worth a thousand words’: Multimodal sensemaking of the global financial crisis. Organization Studies, 39(5-6):617–644, 2018.
  40. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36, 2024.
  41. Leveraging medical twitter to build a visual–language foundation model for pathology ai. bioRxiv, pages 2023–03, 2023.
  42. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  43. Memecap: A dataset for captioning and interpreting memes. arXiv preprint arXiv:2305.13703, 2023.
  44. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
  45. Douwe Kiela. Grounding, meaning and foundation models: Adventures in multimodal machine learning. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5–5, 2022.
  46. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624, 2020.
  47. Embedded multimodal interfaces in robotics: applications, future trends, and societal implications. In The Handbook of Multimodal-Multisensor Interfaces: Language Processing, Software, Commercialization, and Emerging Directions-Volume 3, pages 523–576. 2019.
  48. Multimodal machine learning in precision health: A scoping review. npj Digital Medicine, 5(1):171, 2022.
  49. Grounding language models to images for multimodal inputs and outputs. 2023.
  50. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  51. Review of multimodal machine learning approaches in healthcare. arXiv preprint arXiv:2402.02460, 2024.
  52. Integrating text and image: Determining multimodal document intent in instagram posts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4622–4632, 2019.
  53. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
  54. Visual question answering in radiology (vqa-rad), Feb 2019. URL osf.io/89kps.
  55. Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746, 2022.
  56. Modeling multimodal social interactions: New challenges and baselines with densely aligned representations. arXiv preprint arXiv:2403.02090, 2024.
  57. Holistic evaluation of text-to-image models. arXiv preprint arXiv:2311.04287, 2023.
  58. Enrico: A high-quality dataset for topic modeling of mobile ui designs. Proc. MobileHCI extended abstracts, 2020.
  59. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  60. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1, 2023b.
  61. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  62. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  63. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
  64. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020.
  65. Reform-eval: Evaluating large vision language models via unified re-formulation of task-oriented benchmarks. arXiv preprint arXiv:2310.02569, 2023d.
  66. Multibench: Multiscale benchmarks for multimodal representation learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  67. High-modality multimodal transformer: Quantifying modality & interaction heterogeneity for high-modality representation learning. Transactions on Machine Learning Research, 2022.
  68. Quantifying & modeling multimodal interactions: An information decomposition framework. In Advances in Neural Information Processing Systems, 2023a.
  69. Foundations & trends in multimodal machine learning: Principles, challenges, and open questions. ACM Computing Surveys, 2023b.
  70. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  71. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  72. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021.
  73. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  74. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
  75. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  76. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  77. A taxonomy of relationships between images and text. Journal of documentation, 2003.
  78. Advancing social intelligence in ai agents: Technical challenges and open questions. arXiv preprint arXiv:2404.11023, 2024.
  79. George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
  80. Multimodal learning in health sciences and medicine: Merging technologies to enhance student learning and communication. Biomedical Visualisation: Volume 5, pages 71–78, 2019.
  81. Brad A Myers. A brief history of human-computer interaction technology. interactions, 5(2):44–54, 1998.
  82. Modeling multimodal human-computer interaction. Computer, 37(9):65–72, 2004.
  83. Understanding, categorizing and predicting semantic image-text relations. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 168–176, 2019.
  84. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  85. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  86. Rosalind W Picard. Affective computing. MIT press, 2000.
  87. Connecting vision and language with localized narratives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 647–664. Springer, 2020.
  88. Video event understanding using natural language descriptions. In Proceedings of the IEEE international conference on computer vision, pages 905–912, 2013.
  89. Task Report: Memotion Analysis 1.0 @SemEval 2020: The Visuo-Lingual Metaphor! In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, Sep 2020. Association for Computational Linguistics.
  90. K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558–15573, 2022.
  91. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  92. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  93. A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 217–223, 2017.
  94. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491, 2018.
  95. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023.
  96. Any-to-any generation via composable diffusion. arXiv preprint arXiv:2305.11846, 2023.
  97. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  98. Winoground: Probing vision and language models for visio-linguistic compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5238–5248, 2022.
  99. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, 2019.
  100. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  101. Perspectives in machine learning for wildlife conservation. Nature communications, 13(1):1–15, 2022.
  102. The inaturalist species classification and detection dataset-supplementary material. Reptilia, 32(400):1–3, 2017.
  103. Screen2words: Automatic mobile ui summarization with multimodal learning. In The 34th Annual ACM Symposium on User Interface Software and Technology, pages 498–510, 2021.
  104. Can linguistic knowledge improve multimodal alignment in vision-language pretraining? arXiv preprint arXiv:2308.12898, 2023.
  105. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022a.
  106. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022b.
  107. On the road with gpt-4v(ision): Early explorations of visual-language model on autonomous driving, 2023.
  108. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
  109. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023b.
  110. Multimodal machine learning for automated icd coding. In Machine learning for healthcare conference, pages 197–215. PMLR, 2019.
  111. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  112. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.
  113. Multimodal chatgpt for medical applications: an experimental study of gpt-4v. arXiv preprint arXiv:2310.19061, 2023.
  114. Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings of the 18th SIGSPATIAL international conference on advances in geographic information systems, pages 270–279, 2010.
  115. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  116. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  117. Irfl: Image recognition of figurative language. arXiv preprint arXiv:2303.15445, 2023.
  118. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  119. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL, 2:67–78, 2014.
  120. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  121. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277, 2021.
  122. From recognition to cognition: Visual commonsense reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  123. Magicbrush: A manually annotated dataset for instruction-guided image editing. In Advances in Neural Information Processing Systems, 2023a.
  124. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  125. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  126. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision, 127:302–321, 2019.
  127. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Paul Pu Liang (103 papers)
  2. Akshay Goindani (4 papers)
  3. Talha Chafekar (4 papers)
  4. Leena Mathur (13 papers)
  5. Haofei Yu (17 papers)
  6. Ruslan Salakhutdinov (248 papers)
  7. Louis-Philippe Morency (123 papers)
Citations (3)