InternLM-XComposer2.5-OmniLive Overview
- InternLM-XComposer2.5-OmniLive is an open-source multimodal AI system that integrates real-time streaming perception, efficient memory compression, and adaptive query-driven reasoning.
- It employs specialized, disentangled modules to separately process audio and video, enabling low-latency, context-rich interactions over extended periods.
- The system overcomes traditional sequence-to-sequence limitations by synchronizing immediate sensory input with long-term memory, closely simulating human-like continuous cognition.
InternLM-XComposer2.5-OmniLive (IXC2.5-OL) is an open-source multimodal LLM-based system designed to deliver continuous, long-term, real-time video and audio interaction and reasoning. The approach centers on specialized, disentangled modules for streaming perception, memory compression, and query-driven reasoning, oriented towards simulating human-like cognition and overcoming limitations of classical MLLMs in open-world, persistent environments.
1. System Definition and Design Motivation
IXC2.5-OL is architected to enable persistent multimodal interaction—processing simultaneous video and audio streams, maintaining compact and actionable memory representations over protracted temporal windows, and providing query-driven reasoning grounded in both real-time and historically-compressed context (Zhang et al., 12 Dec 2024). Prior large vision-LLMs (LVLMs) typically rely on sequence-to-sequence paradigms, in which all input is acquired before output is generated, impeding true simultaneous perception and inference. IXC2.5-OL breaks this paradigm through a modular architecture—streaming perception, long memory, and asynchronous reasoning—that allows for ongoing perception, memory updating, and query response, closely simulating human continuous cognition.
Key system goals include:
- Low-latency, real-time multimodal sensory input and semantic encoding
- Efficient compression and integration of short-term observation into scalable long-term memory
- Query-driven retrieval and reasoning that incorporates both instantaneous and aggregated historical context
2. Modular Architecture and Pipeline
IXC2.5-OL comprises three principal modules that operate concurrently and interact asynchronously:
- Processes raw environments (video & audio) on-the-fly
- Audio: Whisper-based encoder → high-dimensional feature vector → audio projector → small LLM (SLM, e.g., Qwen2-1.8B) for both ASR and environmental sound classification
- Video: Frame-sampling (e.g., 1fps) → vision encoder (CLIP-L/14) → semantic patch features
- Outputs are buffered for memory compression and instantaneous query triggers
Multi-modal Long Memory Module
- Receives and compresses outputs from perception
- Two-level memory: Short-term (within video clip), down-sampled spatial features and global summary; Long-term (across clips), aggregated and highly compressed representations
- Compression functions are performed via an LLM-based “Compressor”:
- Here, denotes the feature matrix for clip , is short-term memory, is the global summary, and aggregates long-term memory.
Reasoning Module
- Accepts formatted user queries and coordinates with perception and memory modules
- Leverages an enhanced InternLM-XComposer2.5-based model with a memory-projector to align retrieved memory features with the LLM input space
- Performs instruction prediction filtering to ensure that only legitimate queries (not ambient noise) trigger response generation
- On query, retrieves and integrates relevant historical memory and recent perception features, then generates a context-rich response
High-Level Diagram
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
┌───────────────┐ │ Frontend │ │ (Capture) │ └─────┬─────────┘ ▼ ┌────────────────────┐ │Streaming Perception│ │ Audio Translation │ │ Video Perception │ └─────┬──────────────┘ ▼ ┌────────────────────────┐ │Multi-modal Long Memory │ │ Compression & Fusion │ └─────┬──────────────────┘ ▼ ┌───────────────────────┐ │Reasoning Module │ │ Query & Response │ └───────────────────────┘ |
3. Streaming Perception: Algorithms and Implementation
The system decouples video and audio processing, retaining each modality’s semantic clarity while facilitating distributed computation:
Audio Translation:
- Encoded via Whisper; projected via MLP into SLM’s token space
- SLM performs ASR and audio event classification concurrently
- Model is pre-trained for ASR (GigaSpeech, WenetSpeech) and fine-tuned for environmental event classification
Video Perception:
- Frame sampling provides temporal spread while managing resource constraints
- CLIP-L/14 vision encoder outputs semantic patch tokens for each frame
- Features are sent to the memory module for spatial downsampling and compression
This structure enables real-time, parallel sensory encoding with scope for independent scaling and optimization per channel.
4. Multi-modal Long Memory: Compression and Retrieval Mechanisms
Addressing the impractical cost of storing all sequence data for long-term interactions, IXC2.5-OL uses an LLM-based compressor as follows:
- Video clips are represented as (feature matrices) and compressed to short-term memory , with an additional global vector
- Multiple clips’ short-term memories and global summaries are compressed to form the long-term memory
- Upon query receipt, the query is encoded and concatenated with long-term memory representations. Similarity scores between the query embedding and each global memory vector enable rapid identification and retrieval of the most relevant clips for reasoning
- This process makes the storage and retrieval of relevant historical context feasible over protracted continuous sessions
5. Reasoning and Adaptive Query Response
The Reasoning Module processes queries in a structured format integrating the user's question, retrieved memory, and related clip information. Signal quality is preserved via instruction prediction, preventing spurious activations triggered by environmental noise. Coordinated hand-offs between modules ensure the model grounds its outputs in both immediate reality and relevant historical memory.
Query Input Format:
- "Question: <|Que|>"
- Referenced video information: "<|Img|>"
- Retrieved memory: "<|Mem|>"
This format aligns with experimental protocols for grounding reasoning in both spatial and temporal context.
6. Innovations, Technical Challenges, and Model Comparison
IXC2.5-OL’s primary technical innovations include:
- Disentangled workflow: streaming perception, compressed memory, adaptive reasoning, in contrast to monolithic sequence models
- Real-time multimodal integration: concurrent processing through dedicated threads for each modality
- Memory-efficient compression: LLM-based compressor mitigates prohibitive long-context cost
- Robust query filtering: instruction prediction mechanism ensures only valid user input triggers reasoning
These shift the model paradigm from static context-limited pairs towards continuous, context-rich cognition. Compared to prior InternLM-XComposer releases (Zhang et al., 2023, Dong et al., 29 Jan 2024, Zhang et al., 3 Jul 2024), IXC2.5-OL extends functionality from free-form interleaved text-image tasks and ultra-long-context support to persistent real-time multimodal interaction and memory.
7. Applications and Future Directions
Practical scenarios include:
- Persistent AI assistants (smart home, surveillance) maintaining contextual relevance and historical continuity
- Real-time multimodal interfaces for events, education, or collaborative environments
- Advanced multimedia search, retrieval, and question answering from extensive video streams
- Testbed for AI systems simulating human-like cognitive architectures—perception, memory, and reasoning as dynamically interacting yet specialized subsystems
Prospective research efforts include extending "long-context video understanding"—enabling entire movie-level analysis—and improving multi-turn, multi-image dialogue retention. The open-source ethos (Zhang et al., 12 Dec 2024) facilitates broad adoption, adaptation, and continued progression towards more adaptive, context-aware multimodal AI systems.
In summary, InternLM-XComposer2.5-OmniLive introduces a modular, real-time approach to streaming multimodal perception, memory compression, and query-driven reasoning, marking a substantial advance in continuous interactive AI and efficient long-term multimodal cognition.