MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs (2406.11833v2)

Published 17 Jun 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-LLMs(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.

PDF HTML Abstract

An Examination of MMDU: A Benchmark and Dataset for Multi-Turn Multi-Image Dialog Understanding in LVLMs

The paper under discussion introduces MMDU, a benchmark tailored for evaluating the proficiency of Large Vision-LLMs (LVLMs) in handling multi-turn dialogs that involve multiple images. This benchmark addresses substantial limitations within the current evaluation landscape for LVLMs by mirroring complex real-world application scenarios more accurately than existing benchmarks.

The paper highlights that current open-source LVLMs predominantly focus on single-turn, single-image interactions, which are inadequate for the intricacies of practical human-AI communication. Notably, MMDU is structured to evaluate long-context interactions involving up to 20 images and 27 dialog turns, demanding the LVLMs to sustain conversational coherence over extensive input sequences. This benchmark significantly surpasses its predecessors in terms of length and complexity, testing the models’ abilities in various dimensions including creativity, richness, visual perception, logical coherence, answer accuracy, and understanding of image relationships.

Additionally, the paper introduces MMDU-45k, an extensive dataset that serves as an instructional tuning resource for LVLMs. The dataset generation involves sophisticated clustering techniques applied on Wikipedia data, ensuring the collection of relevant multi-image inputs that form the basis for multi-turn questions. The authors employ human annotators, enhanced by GPT-4o, to generate and refine these questions, along with their corresponding answers, driven by a careful curation process that mitigates hallucinations and errors.

From the experimental evaluation involving various LVLMs, the paper presents an intriguing insight into the current landscape—open-source models lag significantly behind proprietary counterparts like GPT-4o, primarily due to limited access to rich instructional tuning datasets. The introduction of MMDU-45k as a dataset for fine-tuning results in notable improvements across multiple existing benchmarks, even beyond MMDU. For instance, fine-tuning leads to a performance boost of up to 1.5% in MathVista and 1.2% in ChartQA, showcasing the potential of MMDU-45k in enhancing model capabilities.

The authors speculate that the disparity in performance between open and closed models could be reduced by leveraging comprehensive datasets like MMDU-45k in training open-source models, thereby enhancing their capabilities in understanding and generating responses in contextually rich and complex dialog scenarios.

The implications of this research are noteworthy. MMDU sets a new standard for evaluating LVLMs in scenarios that require processing of detailed, prolonged multi-modal dialogs. Furthermore, the structured format of MMDU-45k paves the way for future developments in AI, providing a valuable resource that can be expanded upon to further bridge the gap between theoretical model capabilities and the demands of practical real-world applications.

However, while the paper takes considerable steps towards addressing existing evaluation deficiencies, it acknowledges constraints such as its focus on English-only datasets without multilingual support. The paper also highlights the potential for LVLMs fine-tuned on MMDU-45k data to propagate biases inherent in the underlying datasets, urging for caution and further research into more equitable and inclusive AI systems.

Overall, the paper offers a robust contribution to the field of AI and LVLMs, providing comprehensive resources and insights that support advancements towards more sophisticated AI interactions.