Streamo-Instruct-465K Video Dataset
- Streamo-Instruct-465K is a large-scale, multi-task instruction tuning dataset designed for real-time, interactive video understanding.
- It leverages 465K instruction–video pairs from ~1,150 hours of diverse public videos, supporting tasks like narration, event grounding, and time-sensitive QA.
- The dataset enhances streaming video models by enabling unified multi-turn dialogue training through an LLM-driven, automated annotation pipeline.
Streamo-Instruct-465K is a large-scale, multi-task instruction tuning dataset specifically constructed for continuous, streaming video understanding. Comprising 465,000 instruction–video pairs drawn from 135,875 public videos, it provides unified multi-task supervision tailored for real-time, interactive video models. Streamo-Instruct-465K enables the training of video LLMs (VLMs) capable of temporally grounded narration, action and event captioning, temporal event grounding, and time-sensitive question answering, thus supporting a broad set of streaming video applications and dynamic user interactions (Xia et al., 24 Dec 2025).
1. Dataset Construction and Structure
Streamo-Instruct-465K sources videos from eight open-access corpora—Koala, LLaVA-Video, ActivityNet, QVHighlight, YouCook2, HACS, EgoTimeQA, DiDeMo, and COIN—covering domains such as cooking, procedural instruction, egocentric perspective, sports, and general human activity. The dataset encompasses a combined ~1,150 hours of video, with individual clips ranging from a few seconds to several minutes (median length ~30 s).
A unified annotation pipeline leverages LLMs for automatic instruction generation, segmenting each video into 1-second windows (1 fps). Diverse tasks are formulated:
- Real-time Narration: Per-second commentary on frame-to-frame changes using multi-window context, post-processed for redundancy and coherence by secondary LLMs.
- Event Caption: Segment-level captioning via temporally grounded sampling, retaining only those samples where segment spans overlap consistently.
- Action Caption: Filtering the event caption pipeline for action-centric prompts, generating step-level procedural annotations.
- Event Grounding: Conversion of event captions into temporal grounding prompts, labeled with precise onset and offset timestamps.
- Time-Sensitive QA (TSQA): Question-answer pairs where the answer changes at annotated temporal landmarks, supporting dynamic, evolving Q&A grounded in temporal video events.
Each task is represented using a consistent prompt template and temporal marker format (<Xs–Ys>), ensuring alignment between instructions and annotated video segments.
2. Task Taxonomy and Sample Distribution
Streamo-Instruct-465K’s instruction–response pairs are distributed approximately as follows:
| Task Type | Number of Samples | Percentage of Total |
|---|---|---|
| Real-time Narration | 140,000 | ~30% |
| Event Caption | 110,000 | ~24% |
| Action Caption | 85,000 | ~18% |
| Event Grounding | 70,000 | ~15% |
| Time-Sensitive QA | 60,000 | ~13% |
| Offline QA (LLaVA-Video) | 65,000 | — |
The class distribution for streaming decision-state labels (<Silence>:<Standby>:<Response>) is heavily imbalanced at roughly 12:3:2. Instructions cover a spectrum from brief observation to complex temporal reasoning, as illustrated by the example:
USER: “Notify me when the light turns green.” ASSISTANT (frames 4s–5s):
<Response> The light just turned green.
Multiple streaming-style tasks are annotated per video, enabling unified multi-task training with diverse supervision signals (Xia et al., 24 Dec 2025).
3. Annotation Methodology and Quality Control
The annotation pipeline is fully automated and LLM-driven, without manual annotators or inter-annotator disagreement management. Each annotation task employs task-specific prompting and robust post-processing:
- Templates and temporal markers ensure instruction consistency.
- GLM-4.5 is used for deduplication and smoothing of real-time narration.
- Consistent overlapping of event segments is required for event and action captions.
- Automated cross-model validation and redundancy filtering serve as primary quality filters.
No reported human-annotator disagreements; quality is instead enforced through heuristics, span consistency checks, and multi-model comparison.
4. Integration into Model Training and Mathematical Formulations
All samples are formatted as interleaved, multi-turn dialogue with 1-second response granularity. Temporal tags are prepended to video-feature tokens. The model training pipeline consists of:
- Vision encoder kept frozen; connector and LLM components are fine-tuned end-to-end.
- A single epoch over the entire dataset (batch size 512, learning rate 1e-5).
- Unified multi-task optimization through exhaustive shuffling; no explicit task curriculum.
The decision-state prediction leverages a focal cross-entropy loss to rebalance sparse <Response> labels:
Temporal grounding is quantitatively evaluated using mean Intersection over Union (mIoU) between predicted and ground-truth event intervals:
Ablation studies confirm that focal loss is critical for recovering rare response states in real-time predictions (Xia et al., 24 Dec 2025).
5. Benchmark Impact and Use Cases
Substituting Streamo-Instruct-465K for prior benchmarks (e.g., ET-Instruct-164K) on OVO-Bench yields +11.8% overall and +7.1% forward-responding task improvements. Models trained on Streamo-Instruct-465K achieve 2–4% absolute accuracy gains on offline video understanding suites such as MVBench, TempCompass, VideoMME, and LongVideoBench. These results indicate substantive improvements in temporal reasoning and multi-task generalization across streaming and traditional video benchmarks.
Typical application scenarios include:
- Live narration and commentary of streaming video content
- Real-time detection and reporting of temporally grounded events
- On-the-fly, evolving question answering as video context changes
- Interactive assistants that can anchor instructions and queries in dynamic visual streams
A plausible implication is that Streamo-Instruct-465K enables unified, interactive multimodal assistants that more closely approximate continuous, intelligent video understanding (Xia et al., 24 Dec 2025).
6. Limitations and Future Directions
Limitations of Streamo-Instruct-465K include strong dependence on LLM-based annotation pipelines, which may variably capture domain diversity across video types. Average streaming context lengths per sample remain modest (~30 s), suggesting future work on scaling to longer or lower-latency streams. The dataset's heavy task and label imbalance poses challenges for less frequent real-time response states, which are partially mitigated by loss reweighting.
Future directions indicated in experimental reports involve:
- Expanding coverage to more complex or longer streaming scenarios
- Optimizing for lower-latency, higher-throughput deployment
- Exploring human-in-the-loop and hybrid annotation pipelines for broader generalization
- Integrating planning or adaptive prompt mechanisms for improved real-time performance
Streamo-Instruct-465K represents the first large-scale, unified temporal instruction-following resource for streaming video understanding, providing the baseline for end-to-end LLM training in interactive, temporally annotated video domains (Xia et al., 24 Dec 2025).