MINERVA: Evaluating Complex Video Reasoning (2505.00681v1)

Published 1 May 2025 in cs.LG and cs.CV

Abstract: Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.

PDF Abstract

The paper "MINERVA: Evaluating Complex Video Reasoning" (Nagrani et al., 1 May 2025 ) introduces a new benchmark dataset designed to evaluate the complex video reasoning capabilities of modern multimodal LLMs (MLLMs). Unlike many existing video question answering (VideoQA) datasets that only provide the final correct answer, Minerva includes detailed, hand-crafted reasoning traces alongside questions and multiple-choice answers. This allows researchers to not only assess whether a model arrives at the correct answer but also how it attempts to do so, providing crucial insights into its reasoning process and failure modes.

The core problem addressed by Minerva is the limitation of current VideoQA benchmarks which primarily offer outcome supervision. This makes it difficult to distinguish between models that genuinely understand and reason about video content from those that might guess, exploit linguistic biases, or rely on superficial cues. Complex video understanding inherently requires a multi-step process involving temporal localization, visual/auditory perception, and logical reasoning. Minerva aims to capture this complexity by providing explicit, step-by-step ground truth reasoning traces.

The Minerva dataset consists of 1,515 challenging questions across 223 videos. Key characteristics of the dataset include:

Complex, Multi-step Questions: Each question is designed to require multiple reasoning steps and often combine two or more skills such as temporal reasoning, counting, cause and effect, spatial perception, OCR, or listening.
Multimodality: Questions often necessitate combining information from both video frames and audio transcripts (ASR).
Diverse Domains and Lengths: Videos are sourced from various domains like short films, sports, educational content, and lifestyle vlogs, with lengths ranging from under 2 minutes to over 1.5 hours, reflecting real-world video consumption.
High Quality and Manual Annotation: The entire dataset, including questions, answers, decoys, and reasoning traces, was meticulously hand-annotated by experienced raters.
Detailed Reasoning Traces: This is the defining feature, providing a step-by-step breakdown of how to arrive at the correct answer, including necessary temporal references (timestamps) and key perceptual observations.

The dataset construction involved several steps: selecting diverse video domains suitable for complex questions, manual annotation of questions, answers, and reasoning traces, quality review by other raters, and adversarial filtering to mitigate potential textual biases (answer choices or ASR alone).

The paper benchmarks a range of frontier open-source and proprietary MLLMs on Minerva. Experiments with blind text-only baselines (using only question/answer choices and ASR) show performance close to random chance, highlighting the necessity of visual information. Ablations on the number of frames provided indicate that performance generally improves with more visual context, although the optimal number varies by model.

Benchmarking results show that even the best-performing models, such as Gemini 2.5 Pro Thinking, achieve an MCQ accuracy of 66.2%, significantly trailing human performance (92.5%). This gap demonstrates that Minerva is a challenging benchmark for current MLLMs. The paper also reveals that "thinking" models, capable of generating intermediate thoughts, achieve higher accuracy, especially with more frames, suggesting the importance of explicit reasoning for this task. Analysis by skill and domain shows that models struggle most with counting, counterfactual reasoning, state changes, and educational videos (particularly math), while performing relatively better on short films. Performance consistently degrades as video length increases.

A notable finding from the benchmarking is the impact of prompting. Asking models to produce reasoning steps ("reason step by step"), and even more so, providing the Minerva reasoning rubric within the prompt, improves the models' final MCQ accuracy. This suggests that aligning the model's reasoning process with structured criteria can lead to better outcomes.

Beyond final answer accuracy, the paper introduces a taxonomy of video reasoning errors identified through analysis of model outputs:

Perceptual Correctness: Errors in identifying objects, actions, events, or interpreting modalities like ASR/OCR.
Temporal Localization: Errors in pinpointing the correct time segments in the video.
Logical Reasoning: Errors in the logical deduction process, given the perceived information.
Completeness: Missing necessary steps in the reasoning trace.

Using this taxonomy, the authors devise the Minerva Reasoning Assessment (MiRA), an LLM-based approach for scoring model-generated reasoning traces. Evaluation comparing MiRA (both reference-based, using ground truth traces, and reference-free) with human judgments reveals that reference-based MiRA correlates better with humans, particularly for temporal and perceptual errors. This reference-based MiRA is then applied to analyze reasoning traces across all benchmarked models on the full dataset.

The MiRA analysis confirms that models struggle most with Temporal Localization and Perceptual Correctness compared to Logical Reasoning and Completeness. This suggests that while current MLLMs, often equipped with powerful text LMs, may produce plausible reasoning structures, they still face significant challenges in accurately perceiving visual details and grounding them correctly in time within long videos. The analysis using MiRA also indicates variations in these specific capabilities even among models with similar overall MCQ scores.

In conclusion, Minerva provides a valuable resource for the video understanding community by offering a challenging benchmark that evaluates not just the final answer but also the intermediate reasoning process. The included reasoning traces and the proposed error taxonomy and evaluation method (MiRA) facilitate fine-grained analysis, highlighting current limitations of MLLMs primarily in temporal grounding and perceptual accuracy, and pointing towards key areas for future research and development in video reasoning. The dataset, questions, answers, and reasoning traces are publicly available at https://github.com/google-deepmind/neptune?tab=readme-ov-file#minerva.