LongVideoBench: Long-Context Video QA
- LongVideoBench is a benchmark designed to evaluate long-context video-language understanding by integrating multimodal cues across extended video sequences.
- It overcomes single-frame bias by requiring models to retrieve and reason over interleaved video frames and subtitles for detailed temporal comprehension.
- Evaluation shows that both proprietary and open-source models struggle with extended temporal reasoning, emphasizing the need for advanced memory and retrieval architectures.
LongVideoBench is a comprehensive question-answering benchmark explicitly designed to evaluate the long-context video–language understanding capabilities of large multimodal models (LMMs). It addresses the "single-frame bias" and context-shortcomings of previous datasets by requiring detailed multimodal retrieval and reasoning over temporally extended, interleaved video-language sequences spanning up to one hour. The benchmark is grounded in the referring reasoning paradigm, where the model must accurately locate and reason over precise, often noncontiguous segments of the input video based on explicit references embedded in the query.
1. Benchmark Motivation and Design Principles
LongVideoBench is constructed to overcome the limitations of prior video understanding benchmarks, which emphasize either short clips or questions answerable using a handful of keyframes. Such practices lead to models that excel at summary-level reasoning but perform poorly on tasks requiring long-range temporal associations or multi-turn entity tracking.
The benchmark enforces a design where model performance only improves if the model can process many frames and integrate temporally distributed cues, thus demanding genuine long-context reasoning rather than shortcut solutions. This is achieved by interleaving video frames and subtitles and by annotating questions that explicitly require the integration of disparate information from within prolonged and diverse video streams.
2. Dataset Composition and Structure
LongVideoBench comprises 3,763 web-sourced videos collected from a broad array of content domains, including movie recaps, daily activities, educational material, news programs, and specialized knowledge domains (e.g., STEM). The dataset spans video durations from short clips (8–15 seconds) to extended sequences reaching 60 minutes, organized into four duration groups:
- 8–15 seconds
- 15–60 seconds
- 180–600 seconds
- 900–3,600 seconds
Each video is furnished with original or transcribed subtitles, temporally aligned with frames to create truly interleaved multimodal input scenarios. Both landscape and portrait aspect ratios are represented to mirror real-world video consumption formats.
3. Referring Reasoning Task and Question Annotation
The core innovation of LongVideoBench is its referring reasoning task. Each question is presented in two parts:
- Referring Query: Selects a specific context within the video (the "referred context"). This is visually marked (e.g., in purple) and can reference a scene, event, object, or text snippet.
- Question Body: Requires the model to reason about what happens at, before, or after the referred context, or to synthesize information from multiple referenced segments.
Questions span two principal reasoning levels:
- Perception (L1): Requires local analysis, such as identifying specific objects, attributes, or immediate events at a given moment.
- Relation (L2): Requires relational or temporal reasoning, such as tracking attribute changes, sequencing scenes, understanding event transitions, or comparing noncontiguous segments.
A total of 6,678 multiple-choice questions are meticulously human-annotated and distributed across 17 categories, with average question length approximately 43.5 words. Extensive multi-level review (primary annotation, examination, revision) ensures question clarity and eliminates ambiguity.
Table: Summary of Reasoning Categories
Level (Type) | Description | Example Category |
---|---|---|
Perception (L1) | Analyze single moment/attribute | S2E, O2E, T2E |
Relation (L2) | Temporal or multi-segment logic | O3O, SAA, SSS, TOS |
(See paper Table \ref{tab:definitiontasks} for the precise taxonomy.)
4. Evaluation Findings and Model Performance
Extensive benchmarking reveals that LongVideoBench is highly challenging even for state-of-the-art proprietary models, such as GPT-4o, Gemini-1.5-Pro, and GPT-4-Turbo. Key findings include:
- Even with optimal settings, proprietary models struggle particularly on tasks requiring long-context and relational reasoning; performance improves significantly as more frames are processed, directly validating the benchmark’s design intent.
- Open-source LMMs show a wider performance gap relative to proprietary models and do not reliably improve with increased input length unless supported by appropriate architecture and training modifications.
- Difficulties are amplified for questions that reference earlier or mid-sequence intervals, indicating that temporal position and memory persistence are critical factors.
Leaderboard tables (e.g., Tab. \ref{tab:max_frames}, Tab. \ref{tab:leaderboard}) underline that frame count and temporal coverage are limiting factors: models unable to process >100 frames show plateaued performance, while those integrating hundreds of frames unlock significantly higher accuracy.
5. Technical and Methodological Considerations
LongVideoBench enforces a rigorous evaluation protocol:
- Multiple-choice questions are balanced and unambiguous, requiring recall of visual details and their alignment with subtitles.
- Annotation quality is maintained through multi-stage review.
- Difficulty is stratified by both video duration and reasoning category (L1/L2), ensuring that performance can be analyzed as a function of both scale and cognitive demand.
- Preliminary analyses (e.g., Figure \ref{fig:depth}) show that performance on referred regions earlier in the video is uniquely challenging, exposing weaknesses in current model memory architectures.
6. Implications for Model Architecture and Future Research
The sustained underperformance of even the best models, particularly on long-context, interleaved video-language reasoning, signals the need for next-generation multimodal architectures with enhanced temporal aggregation, richer memory mechanisms, and improved retrieval of cross-modal cues over long input streams.
Key areas likely to benefit from LongVideoBench include:
- Investigation of advanced memory modules (e.g., attention over extended token windows, external memory buffers).
- Frame compression or advanced sampling algorithms that maximize information density without sacrificing temporal coverage.
- Training regimens or architectural modifications explicitly designed to reward integration of co-referential cues across video and language.
- Scaling up LLM backbones within LMMs, as increased capacity is correlated with improved benchmark performance.
A plausible implication is that advances made and validated on LongVideoBench are likely to generalize well to real-world applications such as detailed video analytics, surveillance, media understanding, and complex instructional content parsing.
7. Broader Impact and Benchmark Availability
LongVideoBench serves as a critical resource and catalyst for long-context multimodal research. It is expected to influence benchmarking standards, model architecture development, and the broader direction of long video–and multimodal–understanding, offering:
- Systematic assessment tools for detailed temporal and cross-modal reasoning.
- A publicly documented protocol for annotation and evaluation, supporting reproducibility and benchmarking transparency.
- A scalable template for future dataset extensions involving longer durations, diversified modalities (e.g., audio), or additional reasoning paradigms.
In practice, any improvements made with respect to LongVideoBench performance are representative of progress towards truly robust, human-like long-form video-language comprehension. The dataset, evaluation scripts, and supporting documentation are accessible for broad academic and industrial adoption, providing a foundation for the ongoing evolution of multimodal LLMs.