InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions (2412.09596v1)

Published 12 Dec 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal LLMs (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal LLMs to provide continuous and adaptive service over time.

Summary

The paper presents a novel system that addresses streaming perception, integrated memory, and reasoning in multimodal long-term interactions.
It employs disentangled modules to simulate human cognition through real-time processing, memory compression, and dynamic query responses.
Performance benchmarks reveal state-of-the-art results in ASR and video tasks, setting new standards for multimodal large language models.

A Comprehensive Review of InternLM-XComposer2.5-OmniLive

The paper "InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions" presents a novel system for enhancing real-time interaction capabilities in Multimodal LLMs (MLLMs). The paper introduces the InternLM-XComposer2.5-OmniLive (IXC2.5-OL), addressing persistent challenges in continuous streaming perception, memory, and reasoning that are not adequately handled by existing models.

InternLM-XComposer2.5-OmniLive is rooted in overcoming the inherent limitations of sequence-to-sequence architectures predominant in current MLLMs that struggle with processing inputs and generating responses simultaneously. The primary contributions include the introduction of disentangled streaming perception, reasoning, and memory mechanisms that closely simulate human-like cognition. The system facilitates real-time interaction with streaming video and audio through three integral modules:

Streaming Perception Module: Capable of processing multimodal information in real-time, this module efficiently stores and retrieves information, triggering reasoning processes when user queries arise.
Multi-modal Long Memory Module: This component integrates both short-term and long-term memory. It compresses short-term memories into long-term formats for efficient storage and retrieval, thereby optimizing performance accuracy over extended periods.
Reasoning Module: Central to executing reasoning tasks and responding to queries, this module effectively coordinates the perception and memory modules, constituting the cognitive core of the proposed architecture.

Performance Evaluation

The IXC2.5-OL system showcases remarkable performance on diverse benchmarks, outperforming prior MLLM architectures in both audio and video-based tasks. The results underscore its superiority in automatic speech recognition (ASR), achieving lower Word Error Rates (WER) when compared to models such as VITA and Mini-Omni on WenetSpeech and LibriSpeech datasets.

The system also excels in several rigorous benchmarks. For video understanding, it demonstrates state-of-the-art (SOTA) results among models with fewer than 10 billion parameters. Particularly in video evaluation benchmarks like MLVU and StreamingBench, IXC2.5-OL's ability to handle real-time interactions is highlighted, achieving new SOTA in open-source models with a notable 73.79% overall success rate in StreamingBench tasks.

Implications and Future Directions

InternLM-XComposer2.5-OmniLive significantly advances the field of MLLMs by simulating human cognitive functionalities, allowing for continuous and dynamic interaction with multimodal data streams. Practically, this research opens avenues for applications requiring sustained AI assistance, offering robust solutions in environments demanding high adaptability and accuracy.

Theoretically, the work prompts further exploration into the architectures of future AI systems that emulate human-like cognitive functions more closely. The release of codes and models on public platforms invites collaborative advancements from the wider AI community.

Future work may explore refining system latency and extending the joint training across varied modalities, leveraging the established foundation for omni-modality integration. Such advancements hold potential for enabling even more seamless, comprehensive interactions in AI systems tailored for complex, real-world applications.

Overall, the paper contributes significantly to the ongoing evolution of MLLMs and represents an important step towards developing AI systems with enhanced capabilities for long-term, real-time cognitive processing.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (29)

First 10 authors:

Tweets

https://twitter.com/_akhaliq/status/1867430663978463555

https://twitter.com/arXivGPT/status/1867996794333843963

https://twitter.com/javaeeeee1/status/1867533150760345870

https://twitter.com/GptMaestro/status/1869198776839368710

YouTube

Show All Videos