Papers
Topics
Authors
Recent
2000 character limit reached

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Published 12 Dec 2024 in cs.CV, cs.AI, and cs.CL | (2412.09596v1)

Abstract: Creating AI systems that can interact with environments over long periods, similar to human cognition, has been a longstanding research goal. Recent advancements in multimodal LLMs (MLLMs) have made significant strides in open-world understanding. However, the challenge of continuous and simultaneous streaming perception, memory, and reasoning remains largely unexplored. Current MLLMs are constrained by their sequence-to-sequence architecture, which limits their ability to process inputs and generate responses simultaneously, akin to being unable to think while perceiving. Furthermore, relying on long contexts to store historical data is impractical for long-term interactions, as retaining all information becomes costly and inefficient. Therefore, rather than relying on a single foundation model to perform all functions, this project draws inspiration from the concept of the Specialized Generalist AI and introduces disentangled streaming perception, reasoning, and memory mechanisms, enabling real-time interaction with streaming video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive (IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module: Processes multimodal information in real-time, storing key details in memory and triggering reasoning in response to user queries. (2) Multi-modal Long Memory Module: Integrates short-term and long-term memory, compressing short-term memories into long-term ones for efficient retrieval and improved accuracy. (3) Reasoning Module: Responds to queries and executes reasoning tasks, coordinating with the perception and memory modules. This project simulates human-like cognition, enabling multimodal LLMs to provide continuous and adaptive service over time.

Summary

  • The paper introduces a robust multimodal framework that integrates streaming perception, memory compression, and reasoning to emulate human cognition.
  • It demonstrates superior performance on ASR and video benchmarks, achieving lower WERs and setting new standards on MLVU and StreamingBench.
  • Implementation reveals challenges with processing latency, suggesting future work on optimizing joint modality training for seamless interactions.

InternLM-XComposer2.5-OmniLive: A Multi-modal System Overview

Introduction

InternLM-XComposer2.5-OmniLive (IXC2.5-OL) is an advanced AI framework designed for real-time interaction with streaming video and audio. The paper addresses the challenges of continuous and simultaneous perception, memory, and reasoning by introducing a multi-module system inspired by human-like cognition. These modules perform distinct roles in processing and reasoning over multimodal inputs, heavily influenced by the specialized generalist AI paradigm, facilitating long-term interactions with dynamic environments. Figure 1

Figure 1: Inspired by human-like cognition and Specialized Generalist AI, IXC2.5-OL facilitates real-time interaction with modules for streaming perception, memory compression, and reasoning.

System Architecture

The IXC2.5-OL framework is structured around three simultaneous modules:

  1. Streaming Perception Module: This module processes real-time video and audio inputs separately. Video is processed using a live perception model, while audio is managed by an encoder, projector, and an SLM for ASR and audio classification tasks.
  2. Multi-modal Long Memory Module: This component integrates short-term and long-term memory, alleviating the inefficiencies associated with maintaining extensive context windows. It compresses accumulating short-term memories into compact long-term forms for efficient retrieval.
  3. Reasoning Module: Activated by perception triggers, this module handles queries using retrieved memories, essentially serving as the AI system's cognitive processor. Figure 2

    Figure 2: Pipeline of the InternLM-XComposer2.5-OmniLive. IXC2.5-OL is constructed by three simultaneous modules including streaming perception, memory, and reasoning.

Technical Evaluation and Standard Benchmarks

IXC2.5-OL demonstrates competitive performance across several benchmarks. On ASR tasks like Wenetspeech and LibriSpeech, the system achieves significantly lower WERs compared to contemporary models. For video understanding, IXC2.5-OL sets new standards in benchmarks such as MLVU, Video-MME, and StreamingBench, showcasing its ability to handle both static queries and dynamic real-time interactions.

Benchmark Highlights

  • MLVU: Achieves a marked improvement in performance with an M-Avg of 66.2%.
  • Video-MME and StreamingBench: Exhibits top-tier results in real-time video evaluation.
  • MVBench and MMBench-Video: Outperforms both open-source and closed-source solutions, demonstrating proficiency in temporal reasoning and free-form QA. Figure 3

    Figure 3: System pipeline of the IXC2.5-OL showing components for capturing, managing, and processing streams.

Implementation Considerations

The IXC2.5-OL framework benefits from disentangling perception, memory, and reasoning, closely resembling human brain functionality. Despite achieving significant performance metrics, there are implementation challenges related to processing latency that require further optimization. Additionally, integrating across all modalities could improve seamless interactions across diverse interaction contexts.

Conclusion

InternLM-XComposer2.5-OmniLive stands as an advanced framework for handling long-term multimodal interactions with real-time processing capabilities. Its architecture provides a robust model for achieving human-like cognition in AI systems, as reflected in its superior benchmark performances. Future work will focus on enhancing system latency and integrating joint modality training to refine interaction consistency and efficiency further.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 153 likes about this paper.