Papers
Topics
Authors
Recent
Search
2000 character limit reached

LVBench: Long Video Understanding Benchmark

Updated 31 March 2026
  • LVBench is an extreme long-video benchmark featuring 103 videos (~117 hours) to evaluate temporal reasoning, event localization, and info extraction.
  • It spans diverse domains such as sports, documentaries, and TV shows with 1,549 multiple-choice questions emphasizing detailed content analysis.
  • LVBench challenges models with long-range dependencies and memory bottlenecks, driving innovations like modular reasoning and dynamic retrieval, achieving accuracies up to 84.1%.

LVBench is an extreme long video understanding benchmark designed to evaluate the capacity of multimodal LLMs (MLLMs) and video-LLMs (VLMs) to perform temporal reasoning, event localization, and information extraction on hour-scale visual content. LVBench, introduced by Wang et al. in 2024, catalyzed a wave of research targeting the scalability and depth of video-language understanding by imposing requirements far beyond traditional short-clip datasets (Wang et al., 2024).

1. Benchmark Definition, Dataset Construction, and Task Structure

LVBench comprises 103 publicly sourced videos with a total duration of ≈117 hours, each video averaging 4,101 seconds (approximately 68 minutes) (Wang et al., 2024). The videos span six domains: sports, documentary, event record, lifestyle, TV shows, and cartoons, ensuring considerable content diversity. After a rigorous manual curation process for narrative coherence, visual clarity, and the absence of heavy dependence on audio cues, each selected video was annotated with an average of 24 multiple-choice questions per video hour.

In total, LVBench provides 1,549 multiple-choice question–answer pairs. Each question is paired with one correct response and three distractors, constructed specifically to require detailed long-range understanding and to ensure broad coverage across several capabilities:

  1. Temporal Grounding (TG): Locating specific moments or intervals.
  2. Summarization (Sum): Abstractive summarization of the entire video.
  3. Reasoning (Rea): Causal, emotional, intentional, or prospective inference.
  4. Entity Recognition (ER): Identification and association of entities, actions, or relationships.
  5. Event Understanding (EU): Scene, event, or category classification.
  6. Key Information Retrieval (KIR): Extraction of specific facts from the visual stream.

The principal evaluation metric is top-1 answer accuracy, defined for the multiple-choice protocol as: Accuracy=#{correct answers}#{total questions}×100%.\text{Accuracy} = \frac{\#\{\text{correct answers}\}}{\#\{\text{total questions}\}} \times 100\%. Summarization, when present, is assessed via human judgment, but the core of the benchmark remains discriminative multiple-choice QA over extended video timescales.

2. Motivations and Benchmarking Challenges

LVBench addresses the limitations of existing short-video datasets (e.g., TGIF-QA, MSRVTT-QA), which typically have clips under one minute and are often restricted in domain or event diversity (Wang et al., 2024). Real-world applications such as embodied intelligence, multi-hour commentary, and long-term decision-making require models to maintain persistent, updatable memory representations and to reason about temporally distributed events and entities.

The significant challenges posed by LVBench include:

  • Long-range temporal dependencies: Queries frequently span events or causal chains separated by thousands of frames.
  • Sparse clue retrieval: Many questions refer to single, short events embedded in a large volume of irrelevant or distracting information.
  • Catastrophic forgetting and information dilution: Models processing the entire sequence or employing naive downsampling tend to hallucinate, overload, or ignore key evidence.
  • Limitations of context length: Even state-of-the-art MLLMs/VLMs are often bounded by several thousand tokens, far less than needed to cover hour-long video at even moderate frame rates.

This context has made LVBench the canonical benchmark for evidence of robust long-video reasoning, event localization, and persistent entity tracking in the face of severe temporal and memory bottlenecks (Wang et al., 2024).

3. Baseline Methodologies and Model Performance

Early evaluations of LVBench demonstrated that off-the-shelf VLMs trained for short-clip QA perform poorly, with most fixed-frame models (e.g., LLaVA-Mini, LongLLaVA, PLLaVA, TimeChat, LLaMA-VID) achieving overall accuracy near or below 27–32%; LLaVA-Video (7B) sits at 23.9% and InternVL2.5 (8B) at 41.8% (Wang et al., 2024, Pan et al., 2 Apr 2025).

More adaptive models and agentic architectures demonstrated substantial improvements. Later methods such as the following yielded new performance tiers on LVBench:

Model / Method Accuracy (%) Reference
LLaVA-Video-7B 42.0 (Yamao et al., 16 Mar 2026)
Flash-VStream 42.0 (Yamao et al., 16 Mar 2026)
QViC-MF (1 fps) 50.3 (+8.3) (Yamao et al., 16 Mar 2026)
AdaReTaKe (Qwen2.5-VL-7B) 51.2 (+5.9) (Wang et al., 16 Mar 2025)
TimeSearch (InternVL2.5 8B) 51.5 (+9.7) (Pan et al., 2 Apr 2025)
video-SALMONN S (pr.-dep.) 52.8 (Sun et al., 13 Oct 2025)
ChronoForge-RL (7B) 52.7 (+7.4–9.3) (Chen, 19 Sep 2025)
VideoDeepResearch (Qwen2.5VL-7B) 50.7 (+5.9) (Yuan et al., 12 Jun 2025)
AVAS 62.3 (Yan et al., 1 May 2025)
DVD (Deep Video Discovery, no sub.) 74.2 (Yin et al., 20 Jan 2026)
Symphony 71.8 (+5.0 vs DVD) (Yan et al., 18 Mar 2026)
HAVEN (2 fps captioning) 84.1 (Yin et al., 20 Jan 2026)

Accuracy figures either correspond to the main reported value or to the quoted improvement over prior open-source SoTA (values in parentheses). All methods are evaluated under the LVBench official protocol, with strict multiple-choice accuracy as the primary metric.

Several observations are apparent:

  • Simple uniform frame sampling and monolithic context increase accuracy only marginally with context size, and eventually saturate due to information overload or dilution (Arnab et al., 1 Jul 2025).
  • Modular and agentic frameworks (e.g., QViC-MF, Agentic Video Analytics, VideoDeepResearch, TimeSearch, LensWalk) use dynamic retrieval, memory feedback, and question-guided perception to address cue sparsity and the need for iterative evidence aggregation. This can yield gains of 8–13 points over the strongest static baselines.
  • Models that build persistent semantic indices (EKGs, hierarchical entity abstractions) and enable multi-stage retrieval (AVAS, HAVEN, Symphony) further boost performance, with HAVEN achieving state-of-the-art accuracy of 84.1% and excelling particularly in the reasoning (80.1%) and temporal grounding (88.2%) categories (Yin et al., 20 Jan 2026).

4. Advances in Long-Video Processing Strategies

Methodological innovations crucial to LVBench performance include:

Adaptive Visual Compression

  • AdaReTaKe utilizes temporally and layer-adaptive token pruning, allocating the compression budget dynamically to preserve semantically important frames and layers (Wang et al., 16 Mar 2025).
  • QViC-MF incorporates question-guided multimodal selective attention with memory feedback, focusing context on question-relevant information and retrieving prior relevant frames during iterative processing (Yamao et al., 16 Mar 2026).

Agentic and Modular Reasoning Pipelines

  • CAViAR and VideoDeepResearch execute agentic workflows, invoking modules (clip retrievers, segment locators, visual perceivers) in a “thought → action” loop. Tool calling is chained conditionally based on outputs, with performance reliant on the system’s ability to decompose questions and select modalities dynamically (Menon et al., 9 Sep 2025, Yuan et al., 12 Jun 2025).
  • Symphony employs a multi-agent cognitive framework, decomposing inference into discrete subtasks managed by collaborating agents for perception, attention, language, and planning, with reflection-based verification and adaptive tool invocation (Yan et al., 18 Mar 2026).

Hierarchical and Memory-Augmented Retrieval

  • AVAS and HAVEN construct explicit event knowledge graphs and hierarchical video/scene/entity indices, enabling efficient tri-view (event/entity/frame) retrieval and Borda-aggregated evidence scoring across the video timeline (Yan et al., 1 May 2025, Yin et al., 20 Jan 2026).
  • video-SALMONN S enhances streaming LLM performance using a Hessian-free test-time-training memory module, preserving long-term evidence with a fixed memory budget while performing prompt-dependent context retrieval (Sun et al., 13 Oct 2025).

Inference-Time Context Curation

  • Temporal Chain of Thought (TCoT) iteratively invokes the model itself in a segment-wise fashion to propose, justify, and select the most relevant context frames—yielding 11.4-point gains at fixed context budgets via recursive evidence filtration (Arnab et al., 1 Jul 2025).

5. Analysis of Ablations, Error Modes, and Advances

Ablation studies across these works confirm that performance gains on LVBench are linked specifically to:

  • Hierarchical or question-guided retrieval vs. uniform sampling: Reducing context to question-relevant events or frames is essential to overcome context window bottlenecks.
  • Memory/cue feedback: Memory feedback or iterative memory update (as in QViC-MF and video-SALMONN S) avoids both catastrophic forgetting and hallucination by reinforcing and reusing salient segments.
  • Reflection and verification modules: Reflection-based or critic modules filter hallucinations and rerank agentic strategies, dramatically improving robustness particularly on timestamp localization and reasoning tasks (Menon et al., 9 Sep 2025, Yan et al., 18 Mar 2026).
  • Scene abstraction and semantic grouping: Scene-localized frame grouping (SLFG) and cohesive entity representation (audio + visual) maintain global coherence and prevent entity drift, as highlighted in HAVEN and SceneQA (Yang et al., 5 Aug 2025, Yin et al., 20 Jan 2026).

Stable error modes include failures in fine-grained temporal localization, multi-hop reasoning where context is missed or overwhelmed by distractors, and degraded entity recognition when entity representations are not persistent across scenes. Recent agentic and hierarchical strategies outperform brute-force model scaling in alleviating these issues.

6. Impact, Limitations, and Future Directions

LVBench has established itself as the principal evaluation suite for hour-scale, open-domain video comprehension, influencing methodology development and benchmarking protocols for video-LLMs. Key impacts and open directions include:

  • Memory architecture research: Continued exploration of persistent memory integration, hierarchical attention, and feedback-based vision-language alignment remains an active area.
  • Agentic and compositional reasoning: Multi-agent architectures, iterative tool use, and planning-based context curation demonstrate outsized impact, especially as models generalize to multi-hour, multi-domain scenarios.
  • Expanded modalities and tasks: The LVBench setting catalyzes research into audio integration, richer annotation schema (beyond MCQ), event segmentation, generative evaluation, and internationalization.
  • Practical deployment: Inference-time efficiency, batching, and cost—though not the focus of most ablations—are nontrivial in real-world settings, with multi-agent and multi-tool methods incurring higher per-query latency.
  • Evaluation protocol expansion: While accuracy remains the dominant metric, new measures for temporal IOU, entity F1, and memory retention over extended horizons are under active consideration for future LVBench releases.

7. Summary Table: Representative Performance on LVBench

Year/Model Major Methodology Acc. (%) Reference
LLaVA-Video-7B (2024) Uniform sampling, 7B MLLM 42.0 (Yamao et al., 16 Mar 2026)
AdaReTaKe Token compression, adaptive allocation 51.2 (Wang et al., 16 Mar 2025)
QViC-MF QSMA, memory feedback 50.3 (Yamao et al., 16 Mar 2026)
TimeSearch Hierarchical search, TAFR, reflection 51.5 (Pan et al., 2 Apr 2025)
video-SALMONN S Streaming memory, prompt-dependency 52.8 (Sun et al., 13 Oct 2025)
ChronoForge-RL TAD, RL frame selection 52.7 (Chen, 19 Sep 2025)
VideoDeepResearch Agentic tool loop 50.7–55.5 (Yuan et al., 12 Jun 2025)
AVAS EKG, agentic retrieval/generation 62.3 (Yan et al., 1 May 2025)
DVD Large-scale offline caption + RAG 74.2 (Yin et al., 20 Jan 2026)
Symphony Multi-agent, cognitive-inspired 71.8 (Yan et al., 18 Mar 2026)
HAVEN Entity cohesion, hierarchy, agentic 84.1 (Yin et al., 20 Jan 2026)

These results collectively demonstrate that the combination of modular, feedback-driven, and agentic approaches, together with event/entity hierarchy construction and multi-step retrieval, represents the current state of the art for extreme long-video understanding as measured by LVBench.


Key references:

(Wang et al., 2024, Yamao et al., 16 Mar 2026, Wang et al., 16 Mar 2025, Pan et al., 2 Apr 2025, Sun et al., 13 Oct 2025, Chen, 19 Sep 2025, Yin et al., 20 Jan 2026, Yuan et al., 12 Jun 2025, Yan et al., 1 May 2025, Yan et al., 18 Mar 2026, Yang et al., 5 Aug 2025, Menon et al., 9 Sep 2025, Arnab et al., 1 Jul 2025, Li et al., 25 Mar 2026, Ren et al., 14 Mar 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LVBench.