Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
The presented paper introduces Video-MME, a novel evaluation benchmark aimed at comprehensively assessing the performance of Multi-Modal LLMs (MLLMs) in video analysis. The need for this benchmark arises from the insufficient exploration and assessment of MLLM capabilities in processing sequential visual data, a gap that has restricted the understanding of these models' true potential across dynamic, real-world scenarios.
Key Features of Video-MME
Video-MME is distinguished from existing benchmarks through several critical features:
- Diversity in Video Types: The benchmark encompasses six primary visual domains and 30 subfields, covering areas such as Knowledge, Film & Television, Sports Competitions, Artistic Performances, Life Recordings, and Multilingual videos. This ensures broad scenario generalizability.
- Duration in Temporal Dimension: Videos range from 11 seconds to 1 hour, capturing short-, medium-, and long-term dynamics. This robust temporal diversity facilitates the evaluation of MLLMs' ability to understand varying contextual dynamics.
- Breadth in Data Modalities: Video-MME integrates multiple data modalities beyond video frames, including subtitles and audios, which enhances the evaluation's coverage of MLLMs' all-round capabilities.
- Quality in Annotations: The dataset includes 900 manually selected videos with 2,700 question-answer (QA) pairs annotated by expert annotators. This rigorous manual labeling facilitates precise and reliable model assessments.
Experimental Results
The experiments conducted with Video-MME provide a comprehensive evaluation of various state-of-the-art MLLMs, including both commercial (e.g., GPT-4 series, Gemini 1.5 Pro) and open-source models (e.g., InternVL-Chat-V1.5, LLaVA-NeXT-Video).
Performance of Commercial Models
- Gemini 1.5 Pro demonstrated superior performance, achieving an average accuracy of 75.7%, significantly outperforming the best open-source model (LLaVA-NeXT-Video) which achieved 52.5%.
- The addition of subtitles and audio, as evaluated with Gemini 1.5 Pro, showed substantial improvements in model accuracy (up to +13.3% in some subcategories), particularly in processing longer videos and tasks requiring substantial domain knowledge.
Performance of Open-Source Models
- Among the open-source models, LLaVA-NeXT-Video showed the best performance with an overall accuracy of 52.5%, indicating a considerable gap between commercial and open-source models.
- For image-based models like Qwen-VL-Max and InternVL-Chat-V1.5, accuracies were comparable to video-specific models, highlighting the importance of robust image understanding as a foundation for video analysis.
Implications and Future Directions
The results using Video-MME reveal several critical insights into the current state of MLLMs and their future development:
- Temporal Dynamics and Long Context Modeling: Both commercial and open-source models show a decline in performance as video length increases, indicating challenges in long context understanding. Future research should focus on architectural innovations, such as temporal Q-Formers and context extension techniques to better handle long-range dependencies in video data.
- Subtitles and Auditory Information: The incorporation of subtitles and audio tracks significantly enhances video understanding, underscoring the importance of multi-modal data. Developing models that can seamlessly integrate these additional modalities will be crucial for improving comprehension in complex, real-world scenarios.
- Diverse and High-Quality Datasets: Building high-quality, diverse datasets with complex temporal reasoning tasks is essential. This will require novel approaches to data collection and annotation, potentially including human-in-the-loop frameworks and automatic data synthesis methods to address the long-tailed nature of multi-modal video data.
Conclusion
Video-MME represents a significant step forward in the evaluation of MLLMs for video analysis, providing a robust benchmark that addresses the limitations of existing benchmarks through its comprehensive scope. By revealing critical areas for improvement and highlighting the importance of multi-modal data integration, Video-MME sets the stage for future advancements in the development and evaluation of MLLMs. This benchmark is expected to inspire future research aimed at achieving more capable and robust multi-modal models, furthering the progress towards more sophisticated and nuanced video understanding capabilities.