Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 76 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 26 tok/s Pro

GPT-5 High 25 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 206 tok/s Pro

GPT OSS 120B 465 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant (2505.05467v1)

Published 8 May 2025 in cs.CV, cs.AI, and cs.CL

Abstract: We present StreamBridge, a simple yet effective framework that seamlessly transforms offline Video-LLMs into streaming-capable models. It addresses two fundamental challenges in adapting existing models into online scenarios: (1) limited capability for multi-turn real-time understanding, and (2) lack of proactive response mechanisms. Specifically, StreamBridge incorporates (1) a memory buffer combined with a round-decayed compression strategy, supporting long-context multi-turn interactions, and (2) a decoupled, lightweight activation model that can be effortlessly integrated into existing Video-LLMs, enabling continuous proactive responses. To further support StreamBridge, we construct Stream-IT, a large-scale dataset tailored for streaming video understanding, featuring interleaved video-text sequences and diverse instruction formats. Extensive experiments show that StreamBridge significantly improves the streaming understanding capabilities of offline Video-LLMs across various tasks, outperforming even proprietary models such as GPT-4o and Gemini 1.5 Pro. Simultaneously, it achieves competitive or superior performance on standard video understanding benchmarks.

Summary

StreamBridge: Adapting Offline Video-LLMs for Streaming Scenarios

The paper introduces StreamBridge, an innovative framework designed to convert offline Video LLMs (Video-LLMs) into models capable of operating in real-time streaming environments. This work targets two primary limitations of existing Video-LLMs: the inability to process multi-turn conversations in real-time and the lack of proactive response capabilities. Video-LLMs traditionally function on pre-recorded video data, which is not suitable for applications requiring real-time analysis, such as robotics and autonomous driving.

StreamBridge Contributions:

Memory Buffer with Round-Decayed Compression: To support multi-turn interactions, StreamBridge implements a memory buffer that stores visual and textual embeddings over time, enabling the model to maintain extended context without exceeding computational constraints. The round-decayed compression strategy selectively compresses older context to prioritize recent information, maintaining efficiency and maximizing relevant context retention.
Decoupled Activation Model: Recognizing the challenge in proactive interactions, the authors introduce a modular activation model separate from the main Video-LLM. This lightweight model evaluates incoming data to decide when to trigger the LLM's response process, optimizing both the responsiveness and computational efficiency of the system.
Stream-IT Dataset: To further bolster the framework's capabilities, a proprietary dataset designed for streaming video comprehension — Stream-IT — is constructed. It includes extensive interleaved video-text samples enabling robust real-time understanding and interaction.

Empirical Evaluation:

The paper substantiates the efficacy of StreamBridge and Stream-IT through extensive empirical evaluation. Models enhanced with the framework outperform existing solutions across multiple benchmarks, including OVO-Bench and Streaming-Bench, showing improved real-time video understanding capabilities. Notably, when adjusted for streaming, models such as Qwen2-VL achieve higher accuracy compared to leading proprietary models such as GPT-4o and Gemini 1.5 Pro.

Moreover, despite the enhancements for streaming applications, models maintain or improve performance across standard video comprehension benchmarks. This indicates the framework’s efficiency and adaptability, preserving generalization properties necessary for a wide array of video analysis tasks.

Theoretical and Practical Implications:

Theoretically, this work contributes to expanding the operational paradigm of Video-LLMs, transitioning them from batch processing to dynamic, interactive streams, a critical feature for future human-machine interactions in active environments. Practically, the implications are profound for industries relying on real-time video analytics, offering potentially improved interaction, efficiency, and flexibility in domains such as surveillance, vehicular automation, and real-time user-guidance systems.

While the framework shows significant advances, potential areas of future development include optimization of the activation model's decision-making process, the refinement of compression techniques for diverse scene types, and exploration into multi-modal streaming inputs.

To conclude, StreamBridge presents a significant leap in transforming offline video analysis models into interactive streaming assistants, a pivotal step towards real-time, contextually aware machine perception and response systems in complex dynamic environments.