OmniStar Dataset for Online Video Understanding

Updated 11 November 2025

OmniStar is a large-scale, expert-annotated video dataset featuring real-world streaming conditions and diverse video scenarios.
The dataset methodically categorizes over 20K YouTube videos into 15 scenarios and 46 subcategories, enabling robust model training and evaluation.
It supports five online video-language tasks with standardized benchmarks and metrics, such as SemCor and TimDiff, to assess model performance.

The OmniStar dataset is a large-scale, expert-annotated corpus developed for training and evaluating online video understanding models under realistic streaming conditions. Introduced in the context of the LiveStar framework, OmniStar encompasses a diverse set of YouTube live-stream and on-demand videos, methodically categorized into 15 real-world scenarios spanning 46 subcategories. The dataset provides standardized protocols and benchmarks for five distinct online evaluation tasks, each engineered to probe temporally grounded and interactive video-language comprehension. OmniStar is publicly released under the CC-BY-4.0 license, with detailed instructions for data acquisition and split reproduction.

1. Composition and Scale

OmniStar comprises a total of 20,137 live streaming video clips, each individually annotated by a professional annotation team. The source materials were sampled from YouTube, leveraging the platform’s native category system to ensure uniform scenario coverage. Durations range from short segments to extended streams exceeding ten minutes. All video data is processed at 3 FPS to emulate realistic streaming task constraints.

Statistic	Value	Notes
Total videos	20,137	Expert-annotated
Training set	19,137
Evaluation set	1,000 (200/task)	No overlap; single hold-out partition
Streaming rate	3 FPS	Video files are downsampled

This scale facilitates robust model training and validation over dynamically evolving, context-rich video environments.

2. Scenario and Subcategory Taxonomy

The dataset is structured around 15 high-level scenarios reflecting common live-streaming and video content domains. Each scenario divides further into 2–4 subcategories, culminating in 46 finer labels.

Travel & Events: walking tours, event coverage (dynamic scenes)
Sports: competitive athletics, team sports (fast motion, object tracking)
Pets & Animals: wildlife, pet vlogs (organic motion)
Music: live concerts, performances (staged lighting; audio cues not used in annotation)
Autos & Vehicles: driving, traffic cameras (high-speed, structured scenes)
Film & Animation: edited animations, cartoons (synthetic, rapid cuts)
Nonprofits & Activism: charity events, protests (crowds, sign-reading)
Science & Technology: lab demonstrations, tech reviews (fine-grained objects)
Education: lectures, tutorials (static camera, slides)
Howto & Style: cooking, DIY (hand-object interactions)
News & Politics: reportage, talking heads (mixed domains)
Entertainment: TV shows, variety clips (multi-camera, graphics)
Comedy: sketch, standup (timing cues, audience)
People & Blogs: personal vlogs (handheld, informal framing)
Gaming: recorded gameplay (screen captures)

This taxonomy ensures comprehensive coverage of real-world video scenarios, supporting generalizable model evaluation across heterogeneous visual contexts.

3. Online Evaluation Tasks

OmniStar introduces five primary online video-language tasks, each tailored for streaming protocols and annotated with scenario-specific ground-truths.

Task	Output Type	Ground Truth
Real-time Narration Generation (RNG)	Narration captions	Timestamped segments, storyline continuity
Online Temporal Grounding (OTG)	Grounded intervals	$[t_{\rm start},\,t_{\rm end}]$ , query-based
Frame-Level Dense QA (FDQ)	QA entries	Time-stamped sequence: $(t_i,\,\text{answer}_i)$
Contextual Online QA (COQ)	Multi-turn QA	Time-stamped QA turns, semantic linkage
Multi-turn Interactive QA (MIQ)	Parallel QA chains	Distinct, context-aware chains, time stamps

Each online task restricts access to only observed frames, demanding real-time, temporally consistent video-language understanding.

4. Annotation Methodology

Annotation employs a semi-automated temporally-dense workflow. Each video was split into semantic clips, each receiving one or more paraphrased captions. For QA supervision:

MIQ & COQ: annotators crafted multi-turn QA chains, preserving semantic continuity and contextual dependencies.
FDQ: annotators monitored visual state changes, updating answers accordingly.

All annotations were completed by 20 expert annotators, using an interface built atop YouTube’s timestamping functionality. Guidelines specified precise timestamping and semantic boundaries for all tasks. The paper does not report inter-annotator agreement, which suggests reproducibility assessments are left for future work.

5. Data Organization and Format

Data is distributed via the associated GitHub repository and organized thus:

videos/: Contains $\langle\text{video\_id}\rangle$ .mp4 streams at 3 FPS.
annotations/: Includes rng.json (RNG captions), otg.json (OTG intervals per query), fdq.json, coq.json, miq.json (QA entries).
metadata.csv: Tabulates video ID, scenario, subcategory, duration, and number of clips.

A plausible implication is that this format supports efficient streaming, incremental loading, and fine-grained benchmark evaluation.

6. Benchmarks, Metrics, and Baselines

OmniStar standardizes five primary online metrics:

Timing Difference (TimDiff):

$\text{TimDiff} \;=\; \frac{1}{N}\sum_{i=1}^{N}\bigl|\,\hat t_i - t_i^*\bigr|$

where $\hat t_i$ is predicted response time, $t_i^*$ ground-truth; missing responses penalized full clip duration.

Timing Redundancy (TimRedun): Mean number of unnecessary responses per semantic clip (lower is better).
Timing Coverage (TimCover):

$\text{TimCover} = \frac{\#\text{clips with ≥1 correct response}}{\#\text{total clips}}$

Measures recall (higher is better).

Semantic Correctness (SemCor): GPT-4o-based scoring, axes: Semantic Accuracy, Language Quality, Information Completeness, each 0–10:

$\text{SemCor} = \frac{\text{Acc} + \text{Qual} + \text{Compl}}{3}$

Summary Fluency (SumFluen): Averaged GPT-4o score over five textual criteria (each 0–10).

Offline benchmarks add Perplexity (PPL), Token Accuracy (TokAcc), and grammatical fluency metrics.

Notable baseline performances (RNG online eval, SemCor/TimDiff):

Model	SemCor	TimDiff
VideoLLM-online	1.68	2.67
VideoLLM-MoD	1.66	2.54
MMDuet	1.93	2.32
LiveStar	3.19	1.91

LiveStar achieves a +19.5% relative gain in SemCor and an 18.1% TimDiff reduction averaged over all tasks, at 3.82 FPS (12% faster than the next best) (Yang et al., 7 Nov 2025).

7. Licensing, Distribution, and Reproducibility

OmniStar, along with codebases for LiveStar modeling, is released via GitHub under a CC-BY-4.0 license. The repository provides comprehensive instructions for raw video acquisition, annotation processing, and train/eval split replication. Implementation is based on Python, PyTorch, and leverages InternVideo2.5 & InternLM2.5 frameworks.

Public accessibility supports downstream model development, cross-method benchmarking, and future protocol extension. The dataset’s scale, scenario diversity, coherent annotation workflows, and rigorous metrics position OmniStar as a foundation for advancing online video understanding research.

PDF Markdown Chat (Pro)

References (1)

LiveStar: Live Streaming Assistant for Real-World Online Video Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OmniStar Dataset.