An Examination of MMDU: A Benchmark and Dataset for Multi-Turn Multi-Image Dialog Understanding in LVLMs
The paper under discussion introduces MMDU, a benchmark tailored for evaluating the proficiency of Large Vision-LLMs (LVLMs) in handling multi-turn dialogs that involve multiple images. This benchmark addresses substantial limitations within the current evaluation landscape for LVLMs by mirroring complex real-world application scenarios more accurately than existing benchmarks.
The paper highlights that current open-source LVLMs predominantly focus on single-turn, single-image interactions, which are inadequate for the intricacies of practical human-AI communication. Notably, MMDU is structured to evaluate long-context interactions involving up to 20 images and 27 dialog turns, demanding the LVLMs to sustain conversational coherence over extensive input sequences. This benchmark significantly surpasses its predecessors in terms of length and complexity, testing the models’ abilities in various dimensions including creativity, richness, visual perception, logical coherence, answer accuracy, and understanding of image relationships.
Additionally, the paper introduces MMDU-45k, an extensive dataset that serves as an instructional tuning resource for LVLMs. The dataset generation involves sophisticated clustering techniques applied on Wikipedia data, ensuring the collection of relevant multi-image inputs that form the basis for multi-turn questions. The authors employ human annotators, enhanced by GPT-4o, to generate and refine these questions, along with their corresponding answers, driven by a careful curation process that mitigates hallucinations and errors.
From the experimental evaluation involving various LVLMs, the paper presents an intriguing insight into the current landscape—open-source models lag significantly behind proprietary counterparts like GPT-4o, primarily due to limited access to rich instructional tuning datasets. The introduction of MMDU-45k as a dataset for fine-tuning results in notable improvements across multiple existing benchmarks, even beyond MMDU. For instance, fine-tuning leads to a performance boost of up to 1.5% in MathVista and 1.2% in ChartQA, showcasing the potential of MMDU-45k in enhancing model capabilities.
The authors speculate that the disparity in performance between open and closed models could be reduced by leveraging comprehensive datasets like MMDU-45k in training open-source models, thereby enhancing their capabilities in understanding and generating responses in contextually rich and complex dialog scenarios.
The implications of this research are noteworthy. MMDU sets a new standard for evaluating LVLMs in scenarios that require processing of detailed, prolonged multi-modal dialogs. Furthermore, the structured format of MMDU-45k paves the way for future developments in AI, providing a valuable resource that can be expanded upon to further bridge the gap between theoretical model capabilities and the demands of practical real-world applications.
However, while the paper takes considerable steps towards addressing existing evaluation deficiencies, it acknowledges constraints such as its focus on English-only datasets without multilingual support. The paper also highlights the potential for LVLMs fine-tuned on MMDU-45k data to propagate biases inherent in the underlying datasets, urging for caution and further research into more equitable and inclusive AI systems.
Overall, the paper offers a robust contribution to the field of AI and LVLMs, providing comprehensive resources and insights that support advancements towards more sophisticated AI interactions.