InternLM-XComposer2.5-OmniLive Overview

Updated 17 September 2025

InternLM-XComposer2.5-OmniLive is an open-source multimodal AI system that integrates real-time streaming perception, efficient memory compression, and adaptive query-driven reasoning.
It employs specialized, disentangled modules to separately process audio and video, enabling low-latency, context-rich interactions over extended periods.
The system overcomes traditional sequence-to-sequence limitations by synchronizing immediate sensory input with long-term memory, closely simulating human-like continuous cognition.

InternLM-XComposer2.5-OmniLive (IXC2.5-OL) is an open-source multimodal LLM-based system designed to deliver continuous, long-term, real-time video and audio interaction and reasoning. The approach centers on specialized, disentangled modules for streaming perception, memory compression, and query-driven reasoning, oriented towards simulating human-like cognition and overcoming limitations of classical MLLMs in open-world, persistent environments.

1. System Definition and Design Motivation

IXC2.5-OL is architected to enable persistent multimodal interaction—processing simultaneous video and audio streams, maintaining compact and actionable memory representations over protracted temporal windows, and providing query-driven reasoning grounded in both real-time and historically-compressed context (Zhang et al., 12 Dec 2024). Prior large vision-LLMs (LVLMs) typically rely on sequence-to-sequence paradigms, in which all input is acquired before output is generated, impeding true simultaneous perception and inference. IXC2.5-OL breaks this paradigm through a modular architecture—streaming perception, long memory, and asynchronous reasoning—that allows for ongoing perception, memory updating, and query response, closely simulating human continuous cognition.

Key system goals include:

Low-latency, real-time multimodal sensory input and semantic encoding
Efficient compression and integration of short-term observation into scalable long-term memory
Query-driven retrieval and reasoning that incorporates both instantaneous and aggregated historical context

2. Modular Architecture and Pipeline

IXC2.5-OL comprises three principal modules that operate concurrently and interact asynchronously:

Streaming Perception Module

Processes raw environments (video & audio) on-the-fly
Audio: Whisper-based encoder → high-dimensional feature vector → audio projector → small LLM (SLM, e.g., Qwen2-1.8B) for both ASR and environmental sound classification
Video: Frame-sampling (e.g., 1fps) → vision encoder (CLIP-L/14) → semantic patch features
Outputs are buffered for memory compression and instantaneous query triggers

Multi-modal Long Memory Module

Receives and compresses outputs from perception
Two-level memory: Short-term (within video clip), down-sampled spatial features and global summary; Long-term (across clips), aggregated and highly compressed representations
Compression functions are performed via an LLM-based “Compressor”:
- $H_k, \hat{H}_k = \mathrm{Compressor}([F_k \circ H_k \circ \hat{H}_k])$
- $\dot{H} = \mathrm{Compressor}([H_1 \circ H_2 \circ ... \circ H_k \circ \hat{H}_1 \circ \hat{H}_2 \circ ... \circ \hat{H}_k])$
- Here, $F_k \in \mathbb{R}^{TN \times C}$ denotes the feature matrix for clip $k$ , $H_k$ is short-term memory, $\hat{H}_k$ is the global summary, and $\dot{H}$ aggregates long-term memory.

Reasoning Module

Accepts formatted user queries and coordinates with perception and memory modules
Leverages an enhanced InternLM-XComposer2.5-based model with a memory-projector to align retrieved memory features with the LLM input space
Performs instruction prediction filtering to ensure that only legitimate queries (not ambient noise) trigger response generation
On query, retrieves and integrates relevant historical memory and recent perception features, then generates a context-rich response

High-Level Diagram

┌───────────────┐
│ Frontend      │
│ (Capture)     │
└─────┬─────────┘
      ▼
┌────────────────────┐
│Streaming Perception│
│ Audio Translation  │
│ Video Perception   │
└─────┬──────────────┘
      ▼
┌────────────────────────┐
│Multi-modal Long Memory │
│ Compression & Fusion   │
└─────┬──────────────────┘
      ▼
┌───────────────────────┐
│Reasoning Module       │
│ Query & Response      │
└───────────────────────┘

3. Streaming Perception: Algorithms and Implementation

The system decouples video and audio processing, retaining each modality’s semantic clarity while facilitating distributed computation:

Audio Translation:

Encoded via Whisper; projected via MLP into SLM’s token space
SLM performs ASR and audio event classification concurrently
Model is pre-trained for ASR (GigaSpeech, WenetSpeech) and fine-tuned for environmental event classification

Video Perception:

Frame sampling provides temporal spread while managing resource constraints
CLIP-L/14 vision encoder outputs semantic patch tokens for each frame
Features are sent to the memory module for spatial downsampling and compression

This structure enables real-time, parallel sensory encoding with scope for independent scaling and optimization per channel.

Addressing the impractical cost of storing all sequence data for long-term interactions, IXC2.5-OL uses an LLM-based compressor as follows:

Video clips are represented as $F_k$ (feature matrices) and compressed to short-term memory $H_k$ , with an additional global vector $\hat{H}_k$
Multiple clips’ short-term memories and global summaries are compressed to form the long-term memory $\dot{H}$
Upon query receipt, the query is encoded and concatenated with long-term memory representations. Similarity scores between the query embedding and each global memory vector $\hat{H}_k$ enable rapid identification and retrieval of the most relevant clips for reasoning
This process makes the storage and retrieval of relevant historical context feasible over protracted continuous sessions

5. Reasoning and Adaptive Query Response

The Reasoning Module processes queries in a structured format integrating the user's question, retrieved memory, and related clip information. Signal quality is preserved via instruction prediction, preventing spurious activations triggered by environmental noise. Coordinated hand-offs between modules ensure the model grounds its outputs in both immediate reality and relevant historical memory.

Query Input Format:

"Question: <|Que|>"
Referenced video information: "<|Img|>"
Retrieved memory: "<|Mem|>"

This format aligns with experimental protocols for grounding reasoning in both spatial and temporal context.

6. Innovations, Technical Challenges, and Model Comparison

IXC2.5-OL’s primary technical innovations include:

Disentangled workflow: streaming perception, compressed memory, adaptive reasoning, in contrast to monolithic sequence models
Real-time multimodal integration: concurrent processing through dedicated threads for each modality
Memory-efficient compression: LLM-based compressor mitigates prohibitive long-context cost
Robust query filtering: instruction prediction mechanism ensures only valid user input triggers reasoning

These shift the model paradigm from static context-limited pairs towards continuous, context-rich cognition. Compared to prior InternLM-XComposer releases (Zhang et al., 2023, Dong et al., 29 Jan 2024, Zhang et al., 3 Jul 2024), IXC2.5-OL extends functionality from free-form interleaved text-image tasks and ultra-long-context support to persistent real-time multimodal interaction and memory.

7. Applications and Future Directions

Practical scenarios include:

Persistent AI assistants (smart home, surveillance) maintaining contextual relevance and historical continuity
Real-time multimodal interfaces for events, education, or collaborative environments
Advanced multimedia search, retrieval, and question answering from extensive video streams
Testbed for AI systems simulating human-like cognitive architectures—perception, memory, and reasoning as dynamically interacting yet specialized subsystems

Prospective research efforts include extending "long-context video understanding"—enabling entire movie-level analysis—and improving multi-turn, multi-image dialogue retention. The open-source ethos (Zhang et al., 12 Dec 2024) facilitates broad adoption, adaptation, and continued progression towards more adaptive, context-aware multimodal AI systems.

In summary, InternLM-XComposer2.5-OmniLive introduces a modular, real-time approach to streaming multimodal perception, memory compression, and query-driven reasoning, marking a substantial advance in continuous interactive AI and efficient long-term multimodal cognition.