- The paper introduces a novel proactive VideoLLM that processes video streams in discrete chunks to achieve low latency and maintain precise dialogue turn-taking.
- It employs a lightweight FLAG token and dual-cache with reverse-RoPE for effective content regulation and infinite, context-aware streaming.
- Experimental results show that Proact-VL outperforms baselines in text quality and temporal accuracy, supporting real-time applications in gaming and education.
Proact-VL: A Proactive VideoLLM for Real-Time Interactive Companions
Motivation and Problem Definition
Advances in Video LLMs (VideoLLMs) have enabled AI systems with the capacity to interpret video streams and interact with users in real time across domains such as game commentary, live-stream interpretation, and coaching. Human-like AI companions for these tasks present several critical challenges: (i) achieving low-latency inference under continuous streaming input, (ii) dynamically and autonomously deciding when to engage or remain silent, and (iii) regulating both the length and density of generated content to match real-time operational constraints without compromising content quality. Existing paradigms for streaming video understanding—primarily chunk-wise, proactive, or real-time models—exhibit trade-offs between response accuracy, latency, and temporal granularity. Notably, previous proactive approaches exhibit coarse event-level granularity, while conventional real-time models lack explicit turn-taking and output regulation, often devolving into excessive or redundant chattering.
Live Gaming Dataset and Benchmark Construction
To address both modeling and benchmarking needs, the authors introduce the Live Gaming Dataset, comprising 561 hours of high-quality, English-language annotated commentary covering 12 widely divergent game titles and genres. This dataset is tailored for three main tasks: (1) Solo Commentary, (2) Multi-agent Co-Commentary, and (3) Player Guidance, spanning both in-domain and generalization settings. The data curation and processing pipeline merge automated ASR (WhisperX-large-v3), robust paralinguistic annotation (Qwen3-Omni-Flash), and domain-specific transcript polishing (DeepSeek-V3.2-Exp). The guide domain is further augmented with explicit temporal alignment of visual events and query-response pairs, using clip-based segmentation and LLM-driven refinement for action description and instructional guidance.
The benchmark suite is partitioned into training, a clip-level test set (Live Gaming Benchmark), and an extended streaming test set (Live Gaming Benchmark-Streaming), collectively supporting robust evaluation at both short-horizon and long-horizon scales.
Proact-VL Framework: Modeling and Training
Proact-VL introduces a hierarchical framework for proactive, real-time interactive agents founded on three architectural pillars:
- Chunk-wise Input-Output Processing: The model ingests discretized video input in temporally-consistent chunks (e.g., one second), each accompanied by optional user query and historical context. The transformer-based architecture maintains causal state with persistent KV caching, enabling incremental, temporally-aligned dialogue context ingestion. Outputs are similarly chunk-aligned, supporting seamless multi-segment streaming responses.
- Lightweight Proactivity Mechanism: At each time step, an explicit FLAG token determines the decision boundary for triggering generation versus silence. A gated MLP response head computes a scoring function over the FLAG hidden state, thresholded to execute binary speak/silence actions independently of text generation. This decoupled “decide-then-generate” policy allows for direct, low-latency regulation of content delivery while mitigating the instability associated with silence token-based approaches.
- Transition-Aware and Regularized Training Objectives: Training is governed by two complementary loss functions: a masked causal language modeling loss for utterance quality, and a transition-smoothed, rate-regularized binary classification loss for response timing. The latter is explicitly weighted to emphasize rare but critical speak/silence transitions and further regularized to enforce both local temporal consistency and target speaking rate, calibrated against annotated human baselines.
To enable seamless, unbounded streaming (given the finite context length constraint), a dual-cache sliding window design with reverse-RoPE position correction is utilized. This ensures positional encoding continuity and context-sensitive inference beyond standard transformer context limits.
Experimental Results
Quantitative Metrics
Proact-VL is evaluated using a diverse suite of metrics capturing both text quality (LLM-based scoring: CC, LiveU, FinalQ) and proactive response quality (TimeDiff, PAUC, and F1 for event alignment and temporal accuracy).
Main numerical results and claims:
- Across Solo Commentary, Co-Commentary, and Guidance settings, Proact-VL achieves the highest or competitive text quality scores; for example, overall CC win rates (vs. Gemini 2.5 Pro) exceed 50% in Solo and Co-Commentary (e.g., CC = 53.62, 51.46), with corresponding F1 scores for proactive timing up to 77.44 in multi-agent settings.
- On generalization tasks (Ego4D, Black Myth Wukong), Proact-VL leads all real-time and proactive baselines, surpassing even strong closed-source offline models on text and response alignment.
- Ablation studies reveal steep drops in both precision/recall and timing metrics when transition-aware or rate-regularization terms are omitted, confirming the necessity for both loss components.
- Live streaming stability over long videos is robust: commentary quality remains stable past the 50-minute horizon, and end-to-end per-chunk latency remains under 0.4 seconds—sustaining 10–15 FPS inference for practical use.
Qualitative Analyses and Case Study
Detailed scenario walk-throughs demonstrate Proact-VL's proficiency in role/persona coordination, conversational turn-taking in multi-agent settings, and step-wise, proactive instructional guidance. The model exhibits strong temporal alignment and turn stability (e.g., refraining from interrupting co-commentators) and delivers temporally-resolved safety/tool use guidance for users (as in Minecraft hazard navigation), all without degenerating to generic filler or excessive verbosity.
Failure Modes and Limitations
Explicit failure cases are identified:
- Limited fine-grained visual grounding due to sparse frame (2 FPS) sampling and lack of robust OCR/numeric reasoning for in-game HUD data, leading to occasional hallucinations or filler speech under information density overload.
- Insufficient entity grounding with respect to game updates and dynamic content, relying excessively on pretrained knowledge rather than real-time entity disambiguation.
Practical and Theoretical Implications
Proact-VL demonstrates genuine advances in the construction of human-like, low-latency AI companions, with direct relevance for automated esports commentary, real-time tutoring, and live educational content generation. Its robust “decide-then-generate” design and modular chunk-wise alignment provide a functional template for scalable, high-sensitivity videoLLMs suitable for deployment in dynamic social and educational applications. The dataset and benchmark design set new standards for evaluation granularity and open longitudinal assessment for streaming AI agents.
Theoretically, this work establishes that proactively-controlled, online generation in multimodal LLMs is achievable with lightweight output gating and transition-focused optimization—contradicting the prevailing assumption that real-time and proactive regulation require large, monolithic policy modules. The demonstrated scalability of the chunk-wise cache scheme, combined with reverse-RoPE correction, also clarifies how infinite streaming can be sustained in transformer architectures without catastrophic context drift.
Contradictory finding: Unlike prior VideoLLM proactive models, which showed a trade-off between proactive control and latency, Proact-VL delivers both high timing fidelity (low TimeDiff) and granular output control, suggesting that chunkwise, proactive gating can dominate both coarse-grained event-based models and naively real-time unregulated models for this domain.
Future Directions
Key open research problems beyond this work include: (i) enhanced evidence-grounded generation via fine-grained, high-FPS visual and OCR entity tracking; (ii) active retrieval-based grounding for up-to-date game content adaptation; (iii) scalable, high-resolution, high-FPS inference under stringent latency constraints; and (iv) safety and factual consistency verification for live deployments to mitigate risks of misinformation or off-task outputs.
Conclusion
Proact-VL contributes a functionally and empirically validated framework for proactive, regulated, real-time VideoLLM-based AI companions, achieving state-of-the-art text and response quality in both clip-level and long-form streaming contexts. Its modular chunk-wise proactivity mechanism, data curation paradigm, and regularized training objectives advance both the methodology and benchmark standards in multimodal, streaming AI research, with significant practical and theoretical ramifications for the field.