Video-MME Benchmark for Multi-modal Video Analysis
- Video-MME is a comprehensive benchmark that evaluates multi-modal LLMs on video analysis with expert annotations and diverse real-world scenarios, covering six visual domains and various temporal ranges.
- It integrates video frames, subtitles, and audio, employing rigorous question-answering protocols to measure models’ temporal reasoning, cross-modal understanding, and long-context memory.
- The benchmark’s findings illustrate a notable gap between commercial and open-source models, highlighting challenges in scaling multi-hop and long-horizon video reasoning for future AI systems.
Video-MME is a large-scale, expert-annotated benchmark designed to comprehensively evaluate the video analysis and multi-modal reasoning capabilities of state-of-the-art Multi-modal LLMs (MLLMs). Developed in response to the predominance of image-focused benchmarks, Video-MME systematically assesses video-LLMs across diverse real-world scenarios, multimodal input conditions, and a broad spectrum of temporal and reasoning challenges. Its construction, evaluation protocols, and subsequent findings have rapidly positioned it as a critical reference for measuring progress in multi-modal artificial intelligence.
1. Benchmark Design and Scope
Video-MME distinguishes itself from previous video QA and multi-modal benchmarks by offering a completeness and rigor previously lacking in the evaluation of MLLMs for video analysis. Key aspects of its design include:
- Diversity of Domains: The dataset encompasses six primary visual domains with 30 fine-grained subfields, including areas such as knowledge, film & television, sports, artistic performance, daily life, and multilingual contexts. This ensures generalizability and scenario breadth.
- Temporal Range: Video-MME covers the full temporal spectrum: short videos (11s–2min, avg. 80.8s), medium (4–15min, avg. 520.2s), and long videos (30–60min, avg. 2471.0s). This supports robust testing of temporal reasoning and long-context memory.
- Data Modalities: Inputs include video frames, subtitles, and—where available—audio streams, thereby enabling assessment of cross-modal understanding and integration.
- Annotation Quality: The benchmark consists of 900 videos (254 hours total) and 2,700 multiple-choice question-answer pairs, with each QA pair manually crafted by expert annotators following strict criteria. All questions are posed to ensure that they cannot be solved without access to the video and associated modalities.
- Certificate Length: A distinctive feature is the explicit measurement of "certificate length"—the temporal window required to answer a question, reaching median values of 890.7s for the long-video subset. This enforces the necessity of genuinely broad context modeling.
2. Construction and Evaluation Procedure
Dataset Development
Video selection is performed to achieve domain/duration balance. Annotators view videos in full, submit candidate questions, and perform multi-stage peer review to ensure coverage, answer uniqueness, and proper distribution of answer choices (A/B/C/D ~25% each). To preclude language prior biases, the hardest negative controls ask strong models (e.g. Gemini 1.5 Pro) to answer from the question alone; accuracy above 15% triggers revision.
Evaluation Protocol
- Each model is evaluated under its maximal supported context (e.g. number of frames, subtitle inclusion, audio track), with input sampling controlled and documented.
- The evaluation metric is classification accuracy (random guess is 25%).
- Twelve distinct task categories are represented, such as action recognition, attribute perception, counting, temporal reasoning, and event localization.
Model Lineup
Benchmarked models include proprietary/commercial MLLMs (GPT-4V, GPT-4o, Gemini 1.5 Pro), as well as recent open-source video models (LLaVA-NeXT-Video, VideoChat2), image models applied to multi-frame video (InternVL-Chat-V1.5), and others.
3. Results and Model Analysis
Quantitative Summary
Model | Accuracy (no subs) | Accuracy (subs) |
---|---|---|
Gemini 1.5 Pro | 75.7% | 81.6% |
GPT-4V | 60.7% | 63.7% |
GPT-4o | 66.2% | 65.8% |
LLaVA-NeXT-Video (34B) | 52.5% | 56.0% |
InternVL-Chat-V1.5 (20B) | 51.5% | 53.2% |
Qwen-VL-Max | 51.8% | 51.7% |
Chat-UniVi-V1.5 (7B) | 41.2% | 46.3% |
Chance | 25% | — |
- Commercial models decisively outperform open-source baselines. The Gemini 1.5 Pro model, especially with subtitles and audio, leads by a large margin.
- Subtitles provide significant gains (up to +9%), especially for long videos and under-resourced languages; audio inputs may offer further benefits, particularly for non-speech cues.
- Sequence length challenge: All models experience accuracy degradation as video duration grows, attributed to context dilution, sampling sparsity, and the increased complexity of long-horizon reasoning.
Detailed Observations
- Image-based MLLMs with multi-frame input maintain competitive performance to specialized video models on short-to-medium video durations.
- Certificate length analysis confirms that most questions require cross-temporal synthesis, not static snapshot inference.
- Multi-modal integration—joint vision, subtitle, and audio utilization—correlates with the largest performance improvements.
- Fine-grained performance breakdowns across skill types (e.g., action recognition, attribute, temporal reasoning) are visualized as radar plots, showing clear strengths and weaknesses by model class.
4. Key Insights and Implications
- There exists a substantial performance gap between open-source and commercial MLLMs as of assessment, with the largest commercial models showing at least 15–25 points higher accuracy on challenging, real-world video understanding tasks.
- The gap widens for longer, more complex videos, and for tasks requiring temporal reasoning and aggregation across modalities.
- Subtitles and audio, when leveraged, can offset information sparseness for long-form video and support reasoning in multilingual or noisy environments.
- A plausible implication is that strong image-language pretraining is a solid baseline for video, but genuine multi-hop and temporal modeling techniques are required for further advances.
5. Lessons for Model Development and Future Research
- Architectural innovations such as advanced attention mechanisms, large-context modeling, token compression, and adaptive frame sampling are prioritized research directions for improving long-horizon video reasoning.
- The need for new large-scale instruction-tuning datasets—spanning long, complex, and multimodal videos with human-aligned QA—emerges as a pressing bottleneck for open-source progress.
- Video-MME demonstrates the necessity of designing benchmarks that explicitly preclude shortcut solutions, language-prior biases, and synthetic frame-only cues.
- The method of certificate length annotation surfaces as an important best practice for future dataset development, ensuring genuine temporal coverage.
6. Resources and Community Access
Video-MME is available at https://video-mme.github.io, providing:
- Downloadable videos, curated subtitles/audio, and annotated QA pairs.
- Extended results, leaderboards, domain- and task-specific analyses.
- Evaluation scripts supporting standardization and reproducibility.
- Illustrated case analyses for complex multi-hop and modality-based QA.
7. Impact and Trajectory
Video-MME rapidly established itself as a reference standard for the evaluation of video-based MLLMs. It not only revealed upper performance bounds and critical failure cases in both closed- and open-source models, but also set a demanding benchmark for models aspiring towards artificial general intelligence with robust, cross-modal, and temporally coherent reasoning. Its construction principles—in-depth human annotation, diversity, certificate length, and multi-modal integration—serve as models for future datasets aimed at the next generation of multi-modal AI research.