The paper introduces MMIE (Massive Multimodal Interleaved Comprehension Evaluation), a novel benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-LLMs (LVLMs). The benchmark comprises 20K multimodal queries across 3 categories, 12 fields, and 102 subfields, incorporating mathematics, coding, physics, literature, health, and arts. MMIE supports interleaved inputs and outputs with multiple-choice and open-ended question formats. The authors propose a reliable automated evaluation metric using a scoring model fine-tuned with human-annotated data and systematic evaluation criteria.
The primary contributions of the paper are:
- The introduction of MMIE, a large-scale interleaved multimodal benchmark for evaluating LVLMs.
- Empirical demonstration of MMIE's difficulty, where the best-performing model (GPT-4o + SDXL) achieves a score of 65.47\%, indicating significant room for improvement.
- A scoring model is proposed that is demonstrated to be reliable and comparable to human evaluation.
The paper addresses two key challenges in the evaluation of interleaved multimodal generation:
- The difficulty in constructing modality-coherent benchmarks.
- The lack of automated evaluation metrics.
To address these challenges, the MMIE benchmark was created from four multimodal datasets, categorized into situational analysis, project-based learning, and multi-step reasoning. The data curation process involved collecting and restructuring existing datasets to align with the interleaved image-and-text format. A multi-step quality control process was implemented to ensure the integrity and consistency of the dataset.
The automated evaluation metric involves fine-tuning InternVL-2-4B with a high-quality multimodal scoring dataset, accompanied by detailed scoring criteria and reference answers. The fine-tuned model is then used as the scoring model.
The experimental setup involves benchmarking four open-source interleaved LVLMs: MiniGPT-5, EMU-2, GILL, and Anole. The models were evaluated using the proposed metric, and the results were compared with human annotations using cosine similarity, mean square error (MSE), mean absolute error (MAE), and Pearson coefficient.
Key findings from the experiments include:
- Evaluated interleaved LVLMs demonstrate average score of 50.80%, highlighting the difficulty of the benchmark.
- Integrated LVLMs outperform open-source interleaved LVLMs by an average of 25.2%.
- The integrated models outperform the best performance of the interleaved model by 14.6%, 26.3%, and 16.1% in situational analysis, project-based learning, and multi-step reasoning, respectively.
- The fine-tuned scoring model demonstrates the closest alignment with human evaluation results, proving to be the most reliable.
Error analysis revealed challenges in temporal understanding and reasoning ability. Temporal understanding issues relate to multimodal information comprehension and cross-modality coherence, while reasoning issues involve complex reasoning and generation capabilities. The authors identify errors in cross-modality coherence, generation adaptability, multimodal information comprehension, and complex reasoning.
The paper concludes by highlighting the challenges and opportunities in interleaved multimodal tasks and states that the proposed metrics provide robust, human-like evaluation performance, significantly reducing errors and biases.