LVBench: Long Video Understanding Benchmark
- LVBench is a benchmark for evaluating long video understanding that challenges multimodal models to perform temporal grounding, causal reasoning, and entity recognition over extended video content.
- It comprises 103 curated YouTube videos across six diverse domains, offering raw frames, metadata, and over 1,500 annotated QA pairs for precise task evaluation.
- Baseline results reveal a large performance gap between current models and human annotators, highlighting the need for advanced memory architectures and improved vision-language integration.
LVBench is a benchmark specifically designed for evaluating long video understanding, probing the limits of multimodal LLMs (MLLMs) and vision-language systems on extended temporal horizons. Unlike prior datasets restricted to short clips, LVBench targets the comprehension, reasoning, and extraction of information from videos of 30 minutes to several hours, directly addressing demands in real-world scenarios such as robotic autonomy, sports commentary, and deep film analysis. LVBench systematically challenges automated systems to exhibit long-term memory, causal tracking, and multi-entity reasoning across a spectrum of narrative structures, with all data and protocols available at https://lvbench.github.io (Wang et al., 2024).
1. Dataset Structure and Modalities
LVBench comprises 103 publicly sourced YouTube videos, totaling approximately 117 hours of content, with each video exceeding 30 minutes (average: 4,101 seconds, or 68.4 minutes). The video corpus spans six primary domains (Sports, Documentary, Event Record, Lifestyle, TV Shows, Cartoons) and 21 subcategories, guaranteeing diversity in visual and narrative complexity.
For each video, the dataset provides:
- Raw video frames at 1 fps (evaluation subset) with higher framerates archived.
- Metadata (JSON Lines under a video_info key): duration, category, resolution.
- Annotated questions (~24 per video hour) paired with four candidate answers (1 correct, 3 distractors). Questions may reference temporal windows (“What happened at 29:30?”), but annotators limit over-reliance on timestamps.
Audio is not released, as current models lack robust multimodal audio-language understanding.
2. Task Design and Annotation Protocols
LVBench operationalizes long video understanding via six core capabilities, each mapped to rigorous annotation templates:
- Temporal Grounding (TG): Detect and recognize events at specific timestamps. Input: “What happened at T?” Options: {A, B, C, D}.
- Summarization (Sum): Generate abstractive, temporally ordered, free-form summaries of entire videos.
- Reasoning (Rea): Includes causal (“Why did X happen?”), emotional, intentional, prospective (prediction) reasoning. All use multiple-choice structure.
- Entity Recognition (ER): Detect, track, and relate entities (object/person/action associations).
- Event Understanding (EU): Classify genre, detect scenes/events, or identify scene changes.
- Key Information Retrieval (KIR): Extract precise details (on-screen text, numeric labels).
Professional annotators created 1,549 total QA pairs, with three staged quality filters:
- Video curation reduced an initial pool of 500 YouTube videos to 103 by demanding protagonist presence, narrative coherence, and visual completeness.
- GLM-4 and GPT-4 LLMs filtered out questions answerable purely by language priors, ensuring each retained item requires true visual evidence.
- Distractor answers matched in style and plausibility.
3. Evaluation Metrics and Analysis
For multiple-choice tasks (TG, Rea, ER, EU, KIR), accuracy is the principal metric:
Summarization can optionally report ROUGE-L or BLEU scores, though initial evaluation prioritizes human judgment for overall consistency. F1, precision, and recall are provided in LaTeX, but not central:
Capability-wise radar charts (e.g., Figure 1, right panel) are used to profile model strengths/weaknesses across tasks.
4. Baseline Model Performance and Gaps
Table 3 in the original study presents benchmark results for eight models:
| Model | Overall Acc (%) | ER (%) | KIR (%) | TG (%) | Sum (%) | EU (%) | Rea (%) |
|---|---|---|---|---|---|---|---|
| Gemini 1.5 Pro | 33.1 | 32.1 | 39.3 | 31.8 | 32.8 | — | — |
| LLaVA-NeXT | 32.2 | — | — | — | — | 31.2 | 35.0 |
| GPT-4o | 27.0 | — | — | — | — | — | — |
| MovieChat | 22.5 | — | — | — | — | — | — |
| LLaMA-VID | 23.9 | — | — | — | — | — | — |
| LWM | 25.5 | — | — | — | — | — | — |
| Human Annotators | 94.4 | — | — | — | — | — | — |
Key observations:
- Model accuracy remains well below human-level (average 94.4 %).
- Entity tracking (ER) and event-relation extraction score in the mid-20s (%).
- Temporal grounding hovers around 20–30 %.
- Summarization and reasoning tasks remain below 35 %, revealing a persistent deficit in integrating long-range causal/narrative information.
- Instruction adherence is imperfect (Gemini 1.5 Pro: 20.9 % non-compliant answers; MovieChat: heavy defaulting to choice A; LLaVA-NeXT: best compliance).
5. Dataset Release, Licensing, and Usage
LVBench is released under a CC-BY-NC-SA-4.0 license without train/validation/test splits, enabling unrestricted evaluation across the entire 103-video, 1,549-question set. Data and code repositories are openly accessible at the LVBench website and mirrored on Hugging Face in JSONL format.
Recommended usage protocols include:
- Frame sampling at 1 fps or a uniformly spaced subset (32–96 frames per question).
- Strict prompt design adherent to the multiple-choice template (“Please select A, B, C, or D.”).
- LLM filtering for future question additions, maintaining dataset integrity.
- Reporting overall and per-capability accuracy with optional ROUGE-L for summarization.
6. Research Implications and Future Directions
LVBench highlights fundamental open problems in automated long video understanding, notably deficits in entity tracking, temporal grounding, and causal reasoning over extended time-spans. Performance gaps motivate innovations in persistent memory architectures, specialized vision-language encoders, and training protocols better suited for multi-hour context aggregation.
The multi-capability, multi-domain scope of LVBench, combined with rigorous filtering and annotation pipelines, positions it as the reference benchmark for measuring and advancing the state of multimodal AI in scenarios demanding not just momentary perception, but sustained, integrated comprehension and reasoning over hours of video (Wang et al., 2024). Researchers are encouraged to develop novel modeling paradigms, extend evaluations to additional modalities, and contribute further data to benchmark the next generation of long-horizon systems.