Papers
Topics
Authors
Recent
Search
2000 character limit reached

Streamo-Instruct-465K Video Dataset

Updated 29 December 2025
  • Streamo-Instruct-465K is a large-scale, multi-task instruction tuning dataset designed for real-time, interactive video understanding.
  • It leverages 465K instruction–video pairs from ~1,150 hours of diverse public videos, supporting tasks like narration, event grounding, and time-sensitive QA.
  • The dataset enhances streaming video models by enabling unified multi-turn dialogue training through an LLM-driven, automated annotation pipeline.

Streamo-Instruct-465K is a large-scale, multi-task instruction tuning dataset specifically constructed for continuous, streaming video understanding. Comprising 465,000 instruction–video pairs drawn from 135,875 public videos, it provides unified multi-task supervision tailored for real-time, interactive video models. Streamo-Instruct-465K enables the training of video LLMs (VLMs) capable of temporally grounded narration, action and event captioning, temporal event grounding, and time-sensitive question answering, thus supporting a broad set of streaming video applications and dynamic user interactions (Xia et al., 24 Dec 2025).

1. Dataset Construction and Structure

Streamo-Instruct-465K sources videos from eight open-access corpora—Koala, LLaVA-Video, ActivityNet, QVHighlight, YouCook2, HACS, EgoTimeQA, DiDeMo, and COIN—covering domains such as cooking, procedural instruction, egocentric perspective, sports, and general human activity. The dataset encompasses a combined ~1,150 hours of video, with individual clips ranging from a few seconds to several minutes (median length ~30 s).

A unified annotation pipeline leverages LLMs for automatic instruction generation, segmenting each video into 1-second windows (1 fps). Diverse tasks are formulated:

  • Real-time Narration: Per-second commentary on frame-to-frame changes using multi-window context, post-processed for redundancy and coherence by secondary LLMs.
  • Event Caption: Segment-level captioning via temporally grounded sampling, retaining only those samples where segment spans overlap consistently.
  • Action Caption: Filtering the event caption pipeline for action-centric prompts, generating step-level procedural annotations.
  • Event Grounding: Conversion of event captions into temporal grounding prompts, labeled with precise onset and offset timestamps.
  • Time-Sensitive QA (TSQA): Question-answer pairs where the answer changes at annotated temporal landmarks, supporting dynamic, evolving Q&A grounded in temporal video events.

Each task is represented using a consistent prompt template and temporal marker format (<Xs–Ys>), ensuring alignment between instructions and annotated video segments.

2. Task Taxonomy and Sample Distribution

Streamo-Instruct-465K’s instruction–response pairs are distributed approximately as follows:

Task Type Number of Samples Percentage of Total
Real-time Narration 140,000 ~30%
Event Caption 110,000 ~24%
Action Caption 85,000 ~18%
Event Grounding 70,000 ~15%
Time-Sensitive QA 60,000 ~13%
Offline QA (LLaVA-Video) 65,000

The class distribution for streaming decision-state labels (<Silence>:<Standby>:<Response>) is heavily imbalanced at roughly 12:3:2. Instructions cover a spectrum from brief observation to complex temporal reasoning, as illustrated by the example:

USER: “Notify me when the light turns green.” ASSISTANT (frames 4s–5s): <Response> The light just turned green.

Multiple streaming-style tasks are annotated per video, enabling unified multi-task training with diverse supervision signals (Xia et al., 24 Dec 2025).

3. Annotation Methodology and Quality Control

The annotation pipeline is fully automated and LLM-driven, without manual annotators or inter-annotator disagreement management. Each annotation task employs task-specific prompting and robust post-processing:

  • Templates and temporal markers ensure instruction consistency.
  • GLM-4.5 is used for deduplication and smoothing of real-time narration.
  • Consistent overlapping of event segments is required for event and action captions.
  • Automated cross-model validation and redundancy filtering serve as primary quality filters.

No reported human-annotator disagreements; quality is instead enforced through heuristics, span consistency checks, and multi-model comparison.

4. Integration into Model Training and Mathematical Formulations

All samples are formatted as interleaved, multi-turn dialogue with 1-second response granularity. Temporal tags are prepended to video-feature tokens. The model training pipeline consists of:

  • Vision encoder kept frozen; connector and LLM components are fine-tuned end-to-end.
  • A single epoch over the entire dataset (batch size 512, learning rate 1e-5).
  • Unified multi-task optimization through exhaustive shuffling; no explicit task curriculum.

The decision-state prediction leverages a focal cross-entropy loss to rebalance sparse <Response> labels:

wfocal(xi)=(1pci)γ,  γ=2w_\text{focal}(x_i) = (1 - p_{c_i})^\gamma, \; \gamma=2

αk=1S(jSnjnk)\alpha_k = \frac{1}{|S|}\left(\sum_{j\in S} \frac{n_j}{n_k}\right)

Li=αtiwfocal(i)CE(zi,ti),  if tiSL_i = \alpha_{t_i}\cdot w_\text{focal}(i)\cdot \mathrm{CE}(z_i, t_i), \;\text{if } t_i \in S

Ltotal=1MiMLiL_\text{total} = \frac{1}{|M|} \sum_{i\in M} L_i

Temporal grounding is quantitatively evaluated using mean Intersection over Union (mIoU) between predicted and ground-truth event intervals:

IoUi=length(intersection(tipred,tigt))length(union(...))\text{IoU}_i = \frac{\operatorname{length}(\operatorname{intersection}(t_i^{\text{pred}}, t_i^{\text{gt}}))}{\operatorname{length}(\operatorname{union}(...))}

mIoU=1Ni=1NIoUi\text{mIoU} = \frac{1}{N} \sum_{i=1}^{N} \text{IoU}_i

Ablation studies confirm that focal loss is critical for recovering rare response states in real-time predictions (Xia et al., 24 Dec 2025).

5. Benchmark Impact and Use Cases

Substituting Streamo-Instruct-465K for prior benchmarks (e.g., ET-Instruct-164K) on OVO-Bench yields +11.8% overall and +7.1% forward-responding task improvements. Models trained on Streamo-Instruct-465K achieve 2–4% absolute accuracy gains on offline video understanding suites such as MVBench, TempCompass, VideoMME, and LongVideoBench. These results indicate substantive improvements in temporal reasoning and multi-task generalization across streaming and traditional video benchmarks.

Typical application scenarios include:

  • Live narration and commentary of streaming video content
  • Real-time detection and reporting of temporally grounded events
  • On-the-fly, evolving question answering as video context changes
  • Interactive assistants that can anchor instructions and queries in dynamic visual streams

A plausible implication is that Streamo-Instruct-465K enables unified, interactive multimodal assistants that more closely approximate continuous, intelligent video understanding (Xia et al., 24 Dec 2025).

6. Limitations and Future Directions

Limitations of Streamo-Instruct-465K include strong dependence on LLM-based annotation pipelines, which may variably capture domain diversity across video types. Average streaming context lengths per sample remain modest (~30 s), suggesting future work on scaling to longer or lower-latency streams. The dataset's heavy task and label imbalance poses challenges for less frequent real-time response states, which are partially mitigated by loss reweighting.

Future directions indicated in experimental reports involve:

  • Expanding coverage to more complex or longer streaming scenarios
  • Optimizing for lower-latency, higher-throughput deployment
  • Exploring human-in-the-loop and hybrid annotation pipelines for broader generalization
  • Integrating planning or adaptive prompt mechanisms for improved real-time performance

Streamo-Instruct-465K represents the first large-scale, unified temporal instruction-following resource for streaming video understanding, providing the baseline for end-to-end LLM training in interactive, temporally annotated video domains (Xia et al., 24 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Streamo-Instruct-465K.