Papers
Topics
Authors
Recent
2000 character limit reached

InternLM-XComposer2.5-OmniLive Overview

Updated 17 September 2025
  • InternLM-XComposer2.5-OmniLive is an open-source multimodal AI system that integrates real-time streaming perception, efficient memory compression, and adaptive query-driven reasoning.
  • It employs specialized, disentangled modules to separately process audio and video, enabling low-latency, context-rich interactions over extended periods.
  • The system overcomes traditional sequence-to-sequence limitations by synchronizing immediate sensory input with long-term memory, closely simulating human-like continuous cognition.

InternLM-XComposer2.5-OmniLive (IXC2.5-OL) is an open-source multimodal LLM-based system designed to deliver continuous, long-term, real-time video and audio interaction and reasoning. The approach centers on specialized, disentangled modules for streaming perception, memory compression, and query-driven reasoning, oriented towards simulating human-like cognition and overcoming limitations of classical MLLMs in open-world, persistent environments.

1. System Definition and Design Motivation

IXC2.5-OL is architected to enable persistent multimodal interaction—processing simultaneous video and audio streams, maintaining compact and actionable memory representations over protracted temporal windows, and providing query-driven reasoning grounded in both real-time and historically-compressed context (Zhang et al., 12 Dec 2024). Prior large vision-LLMs (LVLMs) typically rely on sequence-to-sequence paradigms, in which all input is acquired before output is generated, impeding true simultaneous perception and inference. IXC2.5-OL breaks this paradigm through a modular architecture—streaming perception, long memory, and asynchronous reasoning—that allows for ongoing perception, memory updating, and query response, closely simulating human continuous cognition.

Key system goals include:

  • Low-latency, real-time multimodal sensory input and semantic encoding
  • Efficient compression and integration of short-term observation into scalable long-term memory
  • Query-driven retrieval and reasoning that incorporates both instantaneous and aggregated historical context

2. Modular Architecture and Pipeline

IXC2.5-OL comprises three principal modules that operate concurrently and interact asynchronously:

Streaming Perception Module

  • Processes raw environments (video & audio) on-the-fly
  • Audio: Whisper-based encoder → high-dimensional feature vector → audio projector → small LLM (SLM, e.g., Qwen2-1.8B) for both ASR and environmental sound classification
  • Video: Frame-sampling (e.g., 1fps) → vision encoder (CLIP-L/14) → semantic patch features
  • Outputs are buffered for memory compression and instantaneous query triggers

Multi-modal Long Memory Module

  • Receives and compresses outputs from perception
  • Two-level memory: Short-term (within video clip), down-sampled spatial features and global summary; Long-term (across clips), aggregated and highly compressed representations
  • Compression functions are performed via an LLM-based “Compressor”:
    • Hk,H^k=Compressor([FkHkH^k])H_k, \hat{H}_k = \mathrm{Compressor}([F_k \circ H_k \circ \hat{H}_k])
    • H˙=Compressor([H1H2...HkH^1H^2...H^k])\dot{H} = \mathrm{Compressor}([H_1 \circ H_2 \circ ... \circ H_k \circ \hat{H}_1 \circ \hat{H}_2 \circ ... \circ \hat{H}_k])
    • Here, FkRTN×CF_k \in \mathbb{R}^{TN \times C} denotes the feature matrix for clip kk, HkH_k is short-term memory, H^k\hat{H}_k is the global summary, and H˙\dot{H} aggregates long-term memory.

Reasoning Module

  • Accepts formatted user queries and coordinates with perception and memory modules
  • Leverages an enhanced InternLM-XComposer2.5-based model with a memory-projector to align retrieved memory features with the LLM input space
  • Performs instruction prediction filtering to ensure that only legitimate queries (not ambient noise) trigger response generation
  • On query, retrieves and integrates relevant historical memory and recent perception features, then generates a context-rich response

High-Level Diagram

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌───────────────┐
│ Frontend      │
│ (Capture)     │
└─────┬─────────┘
      ▼
┌────────────────────┐
│Streaming Perception│
│ Audio Translation  │
│ Video Perception   │
└─────┬──────────────┘
      ▼
┌────────────────────────┐
│Multi-modal Long Memory │
│ Compression & Fusion   │
└─────┬──────────────────┘
      ▼
┌───────────────────────┐
│Reasoning Module       │
│ Query & Response      │
└───────────────────────┘

3. Streaming Perception: Algorithms and Implementation

The system decouples video and audio processing, retaining each modality’s semantic clarity while facilitating distributed computation:

Audio Translation:

  • Encoded via Whisper; projected via MLP into SLM’s token space
  • SLM performs ASR and audio event classification concurrently
  • Model is pre-trained for ASR (GigaSpeech, WenetSpeech) and fine-tuned for environmental event classification

Video Perception:

  • Frame sampling provides temporal spread while managing resource constraints
  • CLIP-L/14 vision encoder outputs semantic patch tokens for each frame
  • Features are sent to the memory module for spatial downsampling and compression

This structure enables real-time, parallel sensory encoding with scope for independent scaling and optimization per channel.

4. Multi-modal Long Memory: Compression and Retrieval Mechanisms

Addressing the impractical cost of storing all sequence data for long-term interactions, IXC2.5-OL uses an LLM-based compressor as follows:

  • Video clips are represented as FkF_k (feature matrices) and compressed to short-term memory HkH_k, with an additional global vector H^k\hat{H}_k
  • Multiple clips’ short-term memories and global summaries are compressed to form the long-term memory H˙\dot{H}
  • Upon query receipt, the query is encoded and concatenated with long-term memory representations. Similarity scores between the query embedding and each global memory vector H^k\hat{H}_k enable rapid identification and retrieval of the most relevant clips for reasoning
  • This process makes the storage and retrieval of relevant historical context feasible over protracted continuous sessions

5. Reasoning and Adaptive Query Response

The Reasoning Module processes queries in a structured format integrating the user's question, retrieved memory, and related clip information. Signal quality is preserved via instruction prediction, preventing spurious activations triggered by environmental noise. Coordinated hand-offs between modules ensure the model grounds its outputs in both immediate reality and relevant historical memory.

Query Input Format:

  • "Question: <|Que|>"
  • Referenced video information: "<|Img|>"
  • Retrieved memory: "<|Mem|>"

This format aligns with experimental protocols for grounding reasoning in both spatial and temporal context.

6. Innovations, Technical Challenges, and Model Comparison

IXC2.5-OL’s primary technical innovations include:

  • Disentangled workflow: streaming perception, compressed memory, adaptive reasoning, in contrast to monolithic sequence models
  • Real-time multimodal integration: concurrent processing through dedicated threads for each modality
  • Memory-efficient compression: LLM-based compressor mitigates prohibitive long-context cost
  • Robust query filtering: instruction prediction mechanism ensures only valid user input triggers reasoning

These shift the model paradigm from static context-limited pairs towards continuous, context-rich cognition. Compared to prior InternLM-XComposer releases (Zhang et al., 2023, Dong et al., 29 Jan 2024, Zhang et al., 3 Jul 2024), IXC2.5-OL extends functionality from free-form interleaved text-image tasks and ultra-long-context support to persistent real-time multimodal interaction and memory.

7. Applications and Future Directions

Practical scenarios include:

  • Persistent AI assistants (smart home, surveillance) maintaining contextual relevance and historical continuity
  • Real-time multimodal interfaces for events, education, or collaborative environments
  • Advanced multimedia search, retrieval, and question answering from extensive video streams
  • Testbed for AI systems simulating human-like cognitive architectures—perception, memory, and reasoning as dynamically interacting yet specialized subsystems

Prospective research efforts include extending "long-context video understanding"—enabling entire movie-level analysis—and improving multi-turn, multi-image dialogue retention. The open-source ethos (Zhang et al., 12 Dec 2024) facilitates broad adoption, adaptation, and continued progression towards more adaptive, context-aware multimodal AI systems.


In summary, InternLM-XComposer2.5-OmniLive introduces a modular, real-time approach to streaming multimodal perception, memory compression, and query-driven reasoning, marking a substantial advance in continuous interactive AI and efficient long-term multimodal cognition.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to InternLM-XComposer2.5-OmniLive (IXC2.5-OL).