OVO-Bench: Online Video Benchmark

Updated 29 December 2025

OVO-Bench is a benchmark that evaluates online video understanding by measuring temporal reasoning capabilities in Video-LLMs.
It assesses three distinct streaming regimes—backward tracing, real-time perception, and forward active responding—with specific evaluation protocols.
The benchmark features 644 videos and 2,814 timestamped QA pairs across diverse domains such as sports, gaming, and instructional content.

OVO-Bench (Online-VideO-Benchmark) is a benchmark specifically designed to evaluate online video understanding in Video-LLMs, introducing the concept of temporal awareness—defined as the ability to reason about events with respect to specific timestamps during video stream processing. Unlike traditional offline video benchmarks that operate on the complete video, OVO-Bench focuses on real-world streaming conditions in which models must process videos incrementally and dynamically adapt their responses based on the query time, offering a rigorous framework to diagnose and advance the temporal reasoning capabilities of Video-LLMs (Li et al., 9 Jan 2025).

1. Temporal Awareness and Problem Formalization

Temporal awareness emerges as the key technical distinction between online and offline video reasoning. In an offline paradigm, a model has access to the entire video sequence $X_{(-\infty, +\infty)}$ when producing an answer. In contrast, OVO-Bench requires models to process a streaming input and generate responses conditioned on a query at an arbitrary time $t_0$ . The formal requirement is for the model’s response $R$ to depend on both the query $Q_{t_0}$ and an appropriate window of the stream:

$R = P(Q_{t_0}, X_{[a, b]})$

This setting mandates that the model must reason over partial, dynamically unfolding content, and actively decide whether to reference the past, focus on the present, or withhold replies pending future evidence (forward responding).

2. Temporal Reasoning Scenarios

OVO-Bench probes three distinct online temporal reasoning regimes, defined formally as follows:

Backward Tracing: Answer questions by recalling past events ( $T$ seconds prior to query),

$R_{t_0} = P(Q_{t_0}, X_{(-\infty, t_0-T]})$

Real-Time Perception: Answer based on events unfolding in a recent window,

$R_{t_0} = P(Q_{t_0}, X_{(t_0-T, t_0]})$

Forward Active Responding: Delay response until cumulative future context enables correct answering,

$R_{(t_0, +\infty]} = P(Q_{t_0}, X_{(t_0, +\infty)})$

Collectively, these modes form a "Video Chain-of-Time" reasoning schema, functionally analogous to chain-of-thought reasoning in text LLMs but temporally grounded.

3. Dataset Composition and Annotation Pipeline

The OVO-Bench dataset comprises 644 unique videos sourced from seven domains, including sports, video games, ego-centric, and instructional content. Video durations span from several minutes to half an hour (mean query time: 428.9 s). The benchmark defines twelve tasks structurally segmented across the three temporal regimes:

Scenario	Tasks (selected)	#QA Pairs
Backward Tracing	Episodic Memory, Action Seq. ID, Hallucination Detection
Real-Time Perception	Spatial Understanding, Obj. Recognition, Attr. Recognition, OCR, etc.
Forward Active Responding	Repetition Event Count, Sequential Steps, Clue Reveal Responding
Totals	12 tasks	2,814

Dataset annotation utilizes a hybrid procedure:

Repurposing event-timestamped datasets (e.g., Ego4D QA, COIN, STAR)
Semi-automatic annotation via prompting large video-capable LLMs (e.g., GPT-4o, Gemini-1.5-Pro)
Human refinement of all automatically generated timestamps and questions
Automated QA generation for “Real-Time” tasks, followed by rigorous human curation
Rule-based and human-filtered distractor option generation for MCQs This yields 2,814 high-fidelity, timestamped question–answer pairs.

4. Evaluation Protocol and Metrics

Two core evaluation protocols distinguish OVO-Bench’s online operation simulation:

Backward/Real-Time Perception: For each query at $t_i$ , evaluation restricts model access to $\text{Video}[0: t_i]$ , enforcing causal streaming inference while enabling conventional multiple-choice (MCQ) scoring.
Forward Active Responding: A dense multi-trigger polling protocol polls the model at successive timestamps post- $t_0$ . The model dynamically decides when it has sufficient information to respond, at which point the answer and the decision time are logged.

Scoring leverages accuracy as the primary metric:

$\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N [R_i = A_i]$

For forward-active tasks, reward functions are introduced to incentivize both correctness and response timeliness. The repetition event count task employs:

$\text{Score} = \sum_{i=1}^N e^{i p_1} F(R_{m'},A_m) 2^{-(m'-m)p_2}$

where $p_1=0.2,\,p_2=0.05$ . Other forward tasks use $2^{-(m'-m)p}$ with $p=0.5$ .

These reward early and correct answers while penalizing delayed or inaccurate responses, thereby measuring not only knowledge but also streaming inference capabilities.

5. Empirical Findings and Analysis

Comprehensive evaluation across nine Video-LLMs, two streaming-specific models, and human agents reveals substantial performance gaps:

Model/System	Overall Acc.	Real-Time	Backward	Forward	Hallucination Detection
Human Agents	~93%	93.2%	92.3%	92.9%	91.4%
Gemini 1.5 Pro	65.3%	70.8%	62.3%	57.2%	52.7%
GPT-4o	58.6%
Qwen2-VL-72B	58.8%
Open-source <62%	50–62%				<35%
Flash-VStream-7B	<30%
VideoLLM-online	<30%
GPT-4-turbo (blind)	27.9%

Human agents achieve an overall ~93% accuracy across all regimes. State-of-the-art offline and open Video-LLMs remain far behind, with the best model trailing by ~27 percentage points. Dedicated streaming models perform worse (<30%) than even a blind text-only baseline, and hallucination detection exposes severe reliability gaps.

This systematic deficit indicates that Video-LLMs trained and evaluated on static benchmarks do not acquire the temporal prioritization, memory management, or evidence-waiting behaviors essential for robust online comprehension.

6. Implications, Limitations, and Directions for Future Work

Key implications are threefold:

Effective online video reasoning demands explicit temporal control—such as learnable gating or "wait-for-evidence" policies—rather than merely expanding input context length.
Memory architectures for selective backward retrieval and information forgetting are critical, as naively increasing context length fails to confer temporal reasoning.
Rigorous assessment must move from static snapshot benchmarks to continuous-timeline protocols, as modeled in OVO-Bench, to reveal brittleness in streaming deployment.

Limitations of OVO-Bench include:

Sampling and annotation biases from source datasets
High manual annotation costs constraining scalability
Partial simulation of forward-responding evaluation for models trained in offline regimes

Future research trajectories prompted by OVO-Bench’s findings include:

Developing adaptive chunking policies for dynamic context selection
Memory-augmented or attentional mechanisms for event retrieval
Self-supervised learning objectives tailored to “chain-of-time” reasoning
Generalization to multi-camera, audio-integrated, or embodied robotic scenarios

OVO-Bench and accompanying code are released under a CC BY-NC-SA license (https://github.com/JoeLeelyf/OVO-Bench), enabling broad evaluation and development of Video-LLMs explicitly targeting temporal reasoning under streaming conditions (Li et al., 9 Jan 2025).

PDF Markdown Chat (Pro)

References (1)

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to OVO-Bench.