Real-Time Streaming Video LLM

Updated 27 December 2025

Real-time streaming video LLMs are unified multimodal systems that ingest, process, and interact with continuous video streams in real time.
They leverage specialized modules, including frame-level visual encoders, lightweight adapters, and autoregressive language models, to generate timely narrations and captions.
Innovative memory management and token buffering strategies enable low-latency inference and scalability for sustained and dynamic video understanding.

A real-time streaming video LLM is a unified multimodal system that ingests, processes, and interactively understands continuous video streams. Unlike conventional offline video LLMs—which operate batchwise on entire videos—streaming video LLMs align their architecture, supervision, and inference to the online, temporally-causal, and dynamic nature of real-world video understanding. Such models are architected to emit frame- or event-aligned narration, captioning, grounding, time-sensitive question answering, and other outputs, with strict constraints on latency, memory, and computational cost. The field has rapidly evolved from offline captioning toward unified, instruction-following frameworks capable of proactive and reactive operation, enabled by new datasets, temporal objective functions, memory management paradigms, and software/hardware co-design innovations.

1. Key Architectural Principles

Streaming video LLMs are typically built from three principal modules:

Visual Encoder: A frozen frame-level vision backbone (e.g., CLIP, Qwen-VL, SigLIP) encodes each video frame or clip into one or more dense feature vectors $f_t$ . Temporal granularity is typically set at 1–2 FPS; spatial granularity depends on the use of patch or region embeddings (Xia et al., 24 Dec 2025, Kim et al., 13 Dec 2025, Yang et al., 7 Nov 2025).
Connector/Adapter: A lightweight projection (e.g., MLP) maps $f_t$ into the LLM’s token space. Additional modules such as DETR-QFormers or Q-Formers are used to further compress spatio-temporal content or focus on salient regions (hands, objects) (Chatterjee et al., 10 Apr 2025, Ning et al., 21 May 2025).
Autoregressive LLM: The core LLM (e.g., Qwen2.5-VL, Llama, InternLM, Vicuna) is augmented with special state tokens such as <Silence>, <Standby>, <Response> and is trained to process interleaved visual and text tokens in a single, chronologically ordered token stream (Xia et al., 24 Dec 2025, Yang et al., 7 Nov 2025, Zhang et al., 12 Jun 2024).

The token sequence is typically organized turn-wise, such that each second or chunk of video is input as $\langle t_{\text{start}}-t_{\text{end}}, \text{video-features} \rangle$ , followed by a model-generated state token. Temporal dependencies are captured implicitly via causal self-attention over the expanding context window, and optionally enhanced by explicit temporal attention or positional encodings (Xia et al., 24 Dec 2025, Xu et al., 10 Oct 2025). Pruning and compression strategies (KV discarding, DTD, event-gating) are critical in constraining the memory and computational cost with growing input.

2. Streaming Input Processing and Memory Management

Continuous video ingestion is handled by:

Turn-Based or Event-Based Segmentation: Video is sampled at fixed intervals (typically 1 fps), with each frame or chunk grouped into a “turn” for processing. Some systems utilize event-driven gating, only invoking the LLM when a trigger (e.g., semantic change, external query) is detected (Xia et al., 24 Dec 2025, Ding et al., 8 Mar 2025).
Token Buffering and Sliding Windows: Visual and text tokens are appended to the context window as new frames arrive. As the window grows, memory and computation scale linearly (or worse), so explicit sliding-window mechanisms or dynamic key-value (KV) cache pruning are employed. Compression strategies include top-p attention-guided drop, mean-pooling, and feature-differential masking (Ning et al., 21 May 2025, Yao et al., 24 Apr 2025, Chatterjee et al., 10 Apr 2025).
Redundancy Suppression: Differential Token Drop (DTD) identifies and prunes redundant visual tokens by comparing feature-level similarity between consecutive frames, leading to $\sim$ 80% token reduction while retaining >97% performance on streaming benchmarks (Yao et al., 24 Apr 2025). Event-gated frameworks further reduce LLM invocation frequency.
Memory Hierarchies: Several architectures utilize multi-layered memory structures, partitioning short-term high-detail memories and long-term compressed memories (e.g., verbalized text events, centroid clusters), often organized by hierarchical or FIFO buffers (Xiong et al., 23 Jan 2025, Chatterjee et al., 10 Apr 2025, Zhang et al., 12 Jun 2024, Lin et al., 6 Nov 2024).

3. Instruction-Following Datasets and Supervision

Unified real-time streaming video LLMs are enabled by large-scale, temporally fine-grained instruction datasets:

Multi-Task and Temporal Coverage: Datasets such as Streamo-Instruct-465K feature 465K samples from $\sim$ 135K videos, covering real-time narration, action/event captioning, temporal event grounding, and time-sensitive QA, with tasks labeled by explicit time spans (Xia et al., 24 Dec 2025). Datasets incorporate samples from ActivityNet, YouCook2, COIN, DiDeMo, QVHighlight, HACS, EgoTimeQA, and others.
Streaming Dialogue Formatting: For each turn $(t_{i-1}s$ – $t_i s)$ , the video input is paired with a response state (<Silence>, <Standby>, or <Response>) and optional text. This dialogue-style formatting is critical for matching training and streaming inference regimes.
Class Imbalance and Specialized Losses: Because <Silence> vastly dominates <Response>, advanced loss functions (focal loss, frequency-balanced cross-entropy) are used to prevent response token underfitting. The per-position loss

$L_i = \begin{cases} \alpha_{t_i}w_{\text{focal}}(i)L_{\mathrm{CE}}(i, t_i), & t_i \in S \ L_{\mathrm{CE}}(i, t_i), & \text{otherwise} \end{cases}$

balances the supervision of frequent and rare response states (Xia et al., 24 Dec 2025).

4. Real-Time Inference, Decoding, and Performance

The runtime inference pipeline is designed for immediate, low-latency outputs as new frames arrive:

Per-Turn Step: With each new interval, visual tokens are appended, and the LLM predicts a state token; only if <Response> is triggered does the model emit a textual output.
Decoding: Greedy decoding is typically used to satisfy real-time constraints. Event detection (standby/response) is managed internally, without external control modules (Xia et al., 24 Dec 2025, Yang et al., 7 Nov 2025).
Latency: Systems such as Streamo achieve inference comparable to a single LLM+adapter forward pass per second of video, without intermediate batching or controller overhead.

Empirical results across diverse online and offline benchmarks indicate:

Model	OVO-Bench (online, %)	Streamo-Bench (online, %)	LongVideoBench / VideoMME, Δoffline (%)
Streamo-7B	57.9	55.3	+3.3 over Qwen2.5-VL-7B

Focal plus frequency-balanced loss improves OVO Forward Active metrics compared to cross-entropy only (+21.4% for CRR) (Xia et al., 24 Dec 2025).

5. Scalability: Memory, Hardware, and Software Co-Design

Streaming video LLMs encounter unique system-level bottlenecks:

Context Window Saturation: As video streams grow arbitrarily in time, linear memory and compute become unsustainable. Context windows are either explicitly pruned (sliding buffer, K/V pruning) or gated via semantics (state tokens, event gates).
KV Cache Retrieval and Compression: Dynamic KV cache retrieval methods (e.g., ReSV in V-Rex) cluster tokens by temporal-spatial similarity and adaptively select only high-salience clusters for attention, achieving $3\times$ compression with negligible accuracy drop. The V-Rex hardware accelerator realizes this design with specialized engines (DRE, KVPU, cluster-mapped PCIe fetch) and delivers $1.9\text{--}19.7\times$ speedup and $3.1\text{--}18.5\times$ energy-efficiency gain for real-time edge and server deployment (Kim et al., 13 Dec 2025).
Hardware and Implementation: Single-GPU (RTX A6000, H100) supports state-of-the-art throughput for 7–8B LLMs at 1–8 fps, with peak memory and compute dominated by attention on the current context and not vision backbones (Xu et al., 10 Oct 2025).

6. Evaluation Benchmarks and Metrics

Streaming video LLMs are systematically evaluated on purpose-built benchmarks:

StreamingBench: 900 videos, 18 tasks, 4,500 QA pairs, measuring real-time visual, omni-source, and contextual understanding (Lin et al., 6 Nov 2024). Metrics include Accuracy@ $\Delta t$ , online latency $L_i$ , and proactive output accuracy.
OVO-Bench, Streamo-Bench: Measure joint narration/captioning, temporal grounding, and time-sensitive QA in live settings (Xia et al., 24 Dec 2025).
Real-time and Offline Transfer: Models are required to maintain state-of-the-art offline performance (MVBench, TempCompass, VideoMME, LongVideoBench) while enabling real-time, low-latency, and memory-bounded streaming inference.

7. Emerging Directions, Challenges, and Opportunities

Despite rapid advances, major open areas remain:

Memory and Context Scaling: Practical systems must integrate KV-cache management, sliding window attention, or recurrent memory modules to handle multi-hour streams. Visual token pruning and adaptive frame sampling are under active development (Xia et al., 24 Dec 2025, Kim et al., 13 Dec 2025).
Generalization and Task-Adaptive Computing: Real-time models must dynamically allocate computation across periods of high motion or semantic change, integrating event-driven gating and redundancy suppression.
Proactive and Multi-Task Interaction: Proactive response triggers (e.g., from DTD or <Standby> tokens) and unified multi-task instruction fine-tuning over broad streaming tasks are now standard targets for research (Yao et al., 24 Apr 2025, Yang et al., 7 Nov 2025).
Unified End-to-End Systems: Future advances will require end-to-end integration of vision, language, context memory, hardware-aware computation, and interactive training over heterogeneous, temporally-structured instruction datasets.

The state-of-the-art real-time streaming video LLM is not a simple question-answering system, but a tightly coupled, memory-efficient, instruction-following assistant, spanning narration, event detection, grounding, temporal reasoning, and proactive interaction in live, continuous video environments (Xia et al., 24 Dec 2025, Kim et al., 13 Dec 2025, Lin et al., 6 Nov 2024, Yang et al., 7 Nov 2025, Xu et al., 10 Oct 2025).