Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 87 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 17 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 166 tok/s Pro

GPT OSS 120B 436 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction (2501.03218v1)

Published 6 Jan 2025 in cs.CV

Abstract: Active Real-time interaction with video LLMs introduces a new paradigm for human-computer interaction, where the model not only understands user intent but also responds while continuously processing streaming video on the fly. Unlike offline video LLMs, which analyze the entire video before answering questions, active real-time interaction requires three capabilities: 1) Perception: real-time video monitoring and interaction capturing. 2) Decision: raising proactive interaction in proper situations, 3) Reaction: continuous interaction with users. However, inherent conflicts exist among the desired capabilities. The Decision and Reaction require a contrary Perception scale and grain, and the autoregressive decoding blocks the real-time Perception and Decision during the Reaction. To unify the conflicted capabilities within a harmonious system, we present Dispider, a system that disentangles Perception, Decision, and Reaction. Dispider features a lightweight proactive streaming video processing module that tracks the video stream and identifies optimal moments for interaction. Once the interaction is triggered, an asynchronous interaction module provides detailed responses, while the processing module continues to monitor the video in the meantime. Our disentangled and asynchronous design ensures timely, contextually accurate, and computationally efficient responses, making Dispider ideal for active real-time interaction for long-duration video streams. Experiments show that Dispider not only maintains strong performance in conventional video QA tasks, but also significantly surpasses previous online models in streaming scenario responses, thereby validating the effectiveness of our architecture. The code and model are released at \url{https://github.com/Mark12Ding/Dispider}.

Summary

The paper introduces Dispider, a framework that disentangles perception, decision, and reaction to enable efficient real-time video interactions.
It demonstrates superior performance on streaming video QA tasks with significant improvements in temporal reasoning and multi-step response generation.
The approach is validated on benchmarks like StreamingBench and ETBench, highlighting its potential for interactive applications in dynamic environments.

Enabling Real-Time Interaction in Video LLMs with Dispider

The paper "Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction" introduces an innovative framework for interactive human-computer communication in the context of video LLMs. Dispider focuses on addressing the inherent limitations of traditional video LLMs, which typically require processing entire videos before generating responses. This approach is largely impractical for dynamic, real-time scenarios where users anticipate prompt and continuous feedback as video streams unfold.

At the core of Dispider's architecture are three key functional capabilities: perception, decision, and reaction. These capabilities are disentangled into separate modules that operate asynchronously, thus enabling real-time interaction without blocking the flow of video processing. This is achieved through:

Perception Module: A lightweight module that continuously monitors the video stream and captures real-time interactions, focusing on scene boundary detection to segment videos into meaningful clips. This ensures that only relevant video content is processed, optimizing computational efficiency.
Decision Module: This module evaluates when to engage in interaction based on the aggregation of historical data and current video content. It integrates historical decision tokens and visual information to assess whether a response is warranted, ensuring decisions are contextually informed with the least latency.
Reaction Module: Once interaction is triggered, this module asynchronously generates responses that are detailed and contextually appropriate. This design allows the video processing to continue uninterrupted, maintaining the flow of real-time video input.

Experimental results showcased in the paper demonstrate that Dispider not only achieves robust performance on traditional video question-answering (QA) tasks but also significantly outperforms current state-of-the-art models in streaming scenarios. The architecture is validated through comprehensive evaluations on various benchmarks such as StreamingBench, ETBench, and long-video QA benchmarks, demonstrating its effectiveness in temporal grounding, proactive response generation, and multi-step reasoning.

One of the notable results is that Dispider delivers a significant improvement over VideoLLM-online in real-time interaction tasks, particularly emphasizing temporal reasoning and the handling of diverse video lengths. The implications of these results suggest that Dispider's architecture is particularly well-suited for applications demanding high responsiveness and accuracy in processing streaming video content in real time.

Practically, the framework proposed can enhance various real-world applications ranging from live surveillance systems and autonomous vehicles to interactive educational platforms where real-time video interaction is critical. Theoretically, this research contributes to the understanding of asynchronous processing in multimodal LLMs and offers a promising direction for future studies aimed at reducing computational overhead while increasing interaction quality.

In conclusion, Dispider represents a substantial advancement in the design of video LLMs for real-time application scenarios. By disentangling perception, decision, and reaction capabilities, it provides a framework that is both efficient and effective in addressing the challenges posed by streaming video interactions. As the field of AI continues to evolve, innovations like Dispider pave the way for more sophisticated and contextually aware systems, expanding the potential applications of AI in a range of domains. Future developments may further integrate Dispider with complementary technologies, potentially enhancing its capabilities and broadening its usage in increasingly complex environments.