Streamo: Real-Time Video Streaming LLM

Updated 27 December 2025

Streamo is an integrated real-time streaming video LLM system that unifies interactive narration, captioning, and event grounding over continuous video streams.
It employs a single next-token prediction strategy with response state tokens to synchronize timing and content, ensuring precise temporal alignment.
Leveraging a large-scale instruction dataset and state-aware focal loss, Streamo overcomes class imbalance to enhance multi-task streaming video understanding.

Streamo is an integrated real-time streaming video LLM system designed to provide general-purpose interactive assistance over continuous video streams. Unlike prior online video models that are limited to specific functions such as question answering or captioning, Streamo executes a broad array of streaming video understanding tasks, including real-time narration, action and event captioning, temporal event grounding, and time-sensitive question answering, within a unified end-to-end framework. The system advances the state of multimodal assistants by bridging the gap between offline video perception and real-time, continuous video understanding, leveraging large-scale instruction-tuning data and a novel state-token-based interaction scheme (Xia et al., 24 Dec 2025).

1. System Architecture and Data Flow

Streamo is built atop a vision-language backbone such as Qwen2.5-VL. The architecture freezes the vision encoder and adds a lightweight “connector” that projects per-frame visual features into a space compatible with the LLM’s embedding layer. At each time step, a single video frame is sampled, encoded by the vision module, and the resulting feature is interleaved with textual tokens in a transformer stack.

The model extends the LLM’s vocabulary with three special “response state” tokens—〈Silence〉, 〈Standby〉, and 〈Response〉. These state tokens coordinate both the temporal decision of whether to emit a verbal output and the textual content itself, thus enabling the system to avoid any external control mechanisms such as independent segmentation or additional decision modules. All predictions—including state (“should I speak now?”) and content (“what should I say?”)—are generated through a single next-token prediction process. The system’s one-pass inference strategy reduces latency and achieves precise temporal alignment between model outputs and streaming input (Xia et al., 24 Dec 2025).

2. Streaming Input Encoding and Dialogue Formalism

Rather than consuming entire video clips or fixed windows, Streamo reformulates the streaming task as a multi-turn, temporally-indexed dialogue. A continuous video sequence $V = \{v_1, \ldots, v_T\}$ is decomposed into $N$ one-second segments: $V = \{V^{(1)}, V^{(2)}, \ldots, V^{(N)}\}.$ The training corpus is constructed as a tokenized sequence of video–response pairs: $\mathcal D=\{(V^{(1)},R^{(1)}),(V^{(2)},R^{(2)}),\ldots\},$ where $R^{(i)}$ is either a response-state token or textual output.

Each streaming turn includes a time-boundary marker (e.g., 〈2s–3s〉), the visual feature for the current segment, and possible user instruction input. The LLM’s hidden state and the key–value cache retain contextual information across turns, enabling stateful handling of long video streams with continual memory. By sampling at one frame per second and embedding all prior tokens into the context, Streamo achieves temporal coherence and responsiveness without ad hoc FIFO buffers or external state maintenance (Xia et al., 24 Dec 2025).

3. Instruction-Tuning Dataset: Streamo-Instruct-465K

To support unified multi-task streaming video understanding, the developers of Streamo curated Streamo-Instruct-465K, a large-scale instruction-following dataset. Sourced from 135,875 videos across several established video QA and captioning datasets (including Koala, LLaVA-Video, ActivityNet, QVHighlight, YouCook2, HACS, EgoTimeQA, DiDeMo, and COIN), Streamo-Instruct-465K provides five major types of streaming supervision:

Real-Time Narration: Per-second verbal commentary extracted using Qwen2.5-VL-72B and post-processed by GLM-4.5 for coherence.
Event Captioning: Clip-level summaries established via ARC-Hunyuan-Video-7B, filtered for temporal consistency and precise boundaries.
Action Captioning: Descriptions of discrete actions or procedural steps, utilizing a similar annotation pipeline.
Event Grounding: Model is tasked with monitoring for pre-defined events and marking their temporal extent upon completion.
Time-Sensitive QA: The dataset provides evolving question–answer pairs about object state, actions, or scene changes, generated at precise timestamps, demanding fine-grained temporal reasoning.

Overall, 400,000 streaming examples and 65,000 offline QA samples form a unified corpus with standardized temporal and task annotations, enabling Streamo to achieve joint multi-task learning (Xia et al., 24 Dec 2025).

4. Training Methodology and Loss Design

Streamo adopts an end-to-end training regimen using a single next-token prediction loss across interleaved multimodal streams. The presence of response-state tokens (with a frequency skew of ⟨Silence⟩ : ⟨Standby⟩ : ⟨Response⟩ ≈ 12 : 3 : 2) introduces pronounced class imbalance, making standard cross-entropy unsuitable.

To address this, a state-aware focal loss schema is defined. For state-token positions $t_i \in \mathcal S$ , the loss weighting components are:

Focal factor: $w_{\rm focal}(i) = (1-p_{c_i})^\gamma$ (with $\gamma\geq0$ favoring hard examples)
Frequency-based scalar: $\alpha_k = (1/|\mathcal S|)\cdot (\sum_{j} n_j / n_k)$

The per-token loss combines these as: $\mathcal L_i = \alpha_{t_i}\,w_{\rm focal}(i)\,\mathcal L_{\rm CE}(i,t_i)$ while all non-state tokens use standard cross-entropy. The global loss is averaged over all positions in set $\mathcal M$ : $\mathcal L_{\rm total} = \frac{1}{|\mathcal M|}\sum_{i\in\mathcal M}\mathcal L_i$ Optimization employs freezing the vision encoder, fine-tuning the connector and LLM for a single epoch, batch size 512, learning rate $1\!\times\!10^{-5}$ , and $\gamma=2$ (Xia et al., 24 Dec 2025).

5. Evaluation Protocols and Empirical Results

Streamo’s effectiveness is assessed on multiple axes:

Online Video Understanding (OVO-Bench): Streamo-7B (1 fps) achieves 55.61% average across perception, forward/backward tracing, and response, exceeding Dispider (41.78%) by +13.83 points. Evaluating at 2 fps (without retraining) reaches 57.86%. Training on Streamo-Instruct-465K outperforms ET-Instruct-164K by +11.79 points.
Offline Video Understanding (MVBench, TempCompass, VideoMME, LongVideoBench): Streamo-7B, after streaming conversion, surpasses StreamingVLM-7B (state-of-the-art among offline methods) and improves over its own offline base by +3.3 points, demonstrating that instruction-tuned streaming training does not impede, and can enhance, core perceptive abilities.
Multi-Task Streaming Evaluation (Streamo-Bench): On a custom 3,000-instance benchmark comprising grounding, narration, captioning, and time-sensitive QA, Streamo-7B achieves 55.3% average, compared to 24.6% for the best existing model—demonstrating robust cross-task generalization and instruction-following.
Ablation on Loss Functions: Substituting the focal+alphascale scheme with vanilla cross-entropy sharply reduces CRR (~41.7%), while fixed inverse-frequency weighting marginally helps (CRR ~49.2%). Only the bespoke focal formulation enables full responsiveness (CRR = 82.5%) (Xia et al., 24 Dec 2025).

6. Technical Innovations and Core Contributions

Streamo’s principal contributions are as follows:

End-to-end integration of perception, temporal decision-making, and text generation within a single next-token LLM, eliminating external controllers or buffering modules.
Introduction of “response state” tokens interleaved with standard LLM output, enabling both output-timing and content-determination through unified autoregressive prediction.
Development of Streamo-Instruct-465K, the largest temporally annotated, multi-task instruction-tuning dataset for streaming video understanding.
Definition of Streamo-Bench, a benchmark specifically probing real-time, instruction-following over heterogeneous streaming tasks.
Formulation of a state-aware focal loss to counter severe class imbalance among response state tokens, ensuring accurate frame-level output timing and reducing degenerate silence or response collapse.
Demonstrated generalization gains on both streaming and offline video benchmarks over diverse backbones, supporting the efficacy of unified temporally-annotated instruction tuning (Xia et al., 24 Dec 2025).

7. Limitations and Prospective Research Directions

Despite its performance, Streamo currently faces limitations related to unbounded context growth—since every past frame (feature) and text token is retained to preserve context, memory demand and inference latency rise with video length. Addressing this is projected via integration of sliding-window attention, more efficient cache management, visual token pruning, and adaptive frame compression. Advanced strategies such as hierarchical temporal attention or reservoir sampling of salient frames are anticipated avenues to cap context length while maintaining temporal reasoning capabilities and real-time responsiveness in extended deployments (Xia et al., 24 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Streaming Video Instruction Tuning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Streamo.