Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 150 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 34 tok/s Pro
GPT-4o 113 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 444 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

InternLM-XComposer2.5-OmniLive Overview

Updated 17 September 2025
  • InternLM-XComposer2.5-OmniLive is an open-source multimodal AI system that integrates real-time streaming perception, efficient memory compression, and adaptive query-driven reasoning.
  • It employs specialized, disentangled modules to separately process audio and video, enabling low-latency, context-rich interactions over extended periods.
  • The system overcomes traditional sequence-to-sequence limitations by synchronizing immediate sensory input with long-term memory, closely simulating human-like continuous cognition.

InternLM-XComposer2.5-OmniLive (IXC2.5-OL) is an open-source multimodal LLM-based system designed to deliver continuous, long-term, real-time video and audio interaction and reasoning. The approach centers on specialized, disentangled modules for streaming perception, memory compression, and query-driven reasoning, oriented towards simulating human-like cognition and overcoming limitations of classical MLLMs in open-world, persistent environments.

1. System Definition and Design Motivation

IXC2.5-OL is architected to enable persistent multimodal interaction—processing simultaneous video and audio streams, maintaining compact and actionable memory representations over protracted temporal windows, and providing query-driven reasoning grounded in both real-time and historically-compressed context (Zhang et al., 12 Dec 2024). Prior large vision-LLMs (LVLMs) typically rely on sequence-to-sequence paradigms, in which all input is acquired before output is generated, impeding true simultaneous perception and inference. IXC2.5-OL breaks this paradigm through a modular architecture—streaming perception, long memory, and asynchronous reasoning—that allows for ongoing perception, memory updating, and query response, closely simulating human continuous cognition.

Key system goals include:

  • Low-latency, real-time multimodal sensory input and semantic encoding
  • Efficient compression and integration of short-term observation into scalable long-term memory
  • Query-driven retrieval and reasoning that incorporates both instantaneous and aggregated historical context

2. Modular Architecture and Pipeline

IXC2.5-OL comprises three principal modules that operate concurrently and interact asynchronously:

Streaming Perception Module

  • Processes raw environments (video & audio) on-the-fly
  • Audio: Whisper-based encoder → high-dimensional feature vector → audio projector → small LLM (SLM, e.g., Qwen2-1.8B) for both ASR and environmental sound classification
  • Video: Frame-sampling (e.g., 1fps) → vision encoder (CLIP-L/14) → semantic patch features
  • Outputs are buffered for memory compression and instantaneous query triggers

Multi-modal Long Memory Module

  • Receives and compresses outputs from perception
  • Two-level memory: Short-term (within video clip), down-sampled spatial features and global summary; Long-term (across clips), aggregated and highly compressed representations
  • Compression functions are performed via an LLM-based “Compressor”:
    • Hk,H^k=Compressor([FkHkH^k])H_k, \hat{H}_k = \mathrm{Compressor}([F_k \circ H_k \circ \hat{H}_k])
    • H˙=Compressor([H1H2...HkH^1H^2...H^k])\dot{H} = \mathrm{Compressor}([H_1 \circ H_2 \circ ... \circ H_k \circ \hat{H}_1 \circ \hat{H}_2 \circ ... \circ \hat{H}_k])
    • Here, FkRTN×CF_k \in \mathbb{R}^{TN \times C} denotes the feature matrix for clip kk, HkH_k is short-term memory, H^k\hat{H}_k is the global summary, and H˙\dot{H} aggregates long-term memory.

Reasoning Module

  • Accepts formatted user queries and coordinates with perception and memory modules
  • Leverages an enhanced InternLM-XComposer2.5-based model with a memory-projector to align retrieved memory features with the LLM input space
  • Performs instruction prediction filtering to ensure that only legitimate queries (not ambient noise) trigger response generation
  • On query, retrieves and integrates relevant historical memory and recent perception features, then generates a context-rich response

High-Level Diagram

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
┌───────────────┐
│ Frontend      │
│ (Capture)     │
└─────┬─────────┘
      ▼
┌────────────────────┐
│Streaming Perception│
│ Audio Translation  │
│ Video Perception   │
└─────┬──────────────┘
      ▼
┌────────────────────────┐
│Multi-modal Long Memory │
│ Compression & Fusion   │
└─────┬──────────────────┘
      ▼
┌───────────────────────┐
│Reasoning Module       │
│ Query & Response      │
└───────────────────────┘

3. Streaming Perception: Algorithms and Implementation

The system decouples video and audio processing, retaining each modality’s semantic clarity while facilitating distributed computation:

Audio Translation:

  • Encoded via Whisper; projected via MLP into SLM’s token space
  • SLM performs ASR and audio event classification concurrently
  • Model is pre-trained for ASR (GigaSpeech, WenetSpeech) and fine-tuned for environmental event classification

Video Perception:

  • Frame sampling provides temporal spread while managing resource constraints
  • CLIP-L/14 vision encoder outputs semantic patch tokens for each frame
  • Features are sent to the memory module for spatial downsampling and compression

This structure enables real-time, parallel sensory encoding with scope for independent scaling and optimization per channel.

4. Multi-modal Long Memory: Compression and Retrieval Mechanisms

Addressing the impractical cost of storing all sequence data for long-term interactions, IXC2.5-OL uses an LLM-based compressor as follows:

  • Video clips are represented as FkF_k (feature matrices) and compressed to short-term memory HkH_k, with an additional global vector H^k\hat{H}_k
  • Multiple clips’ short-term memories and global summaries are compressed to form the long-term memory H˙\dot{H}
  • Upon query receipt, the query is encoded and concatenated with long-term memory representations. Similarity scores between the query embedding and each global memory vector H^k\hat{H}_k enable rapid identification and retrieval of the most relevant clips for reasoning
  • This process makes the storage and retrieval of relevant historical context feasible over protracted continuous sessions

5. Reasoning and Adaptive Query Response

The Reasoning Module processes queries in a structured format integrating the user's question, retrieved memory, and related clip information. Signal quality is preserved via instruction prediction, preventing spurious activations triggered by environmental noise. Coordinated hand-offs between modules ensure the model grounds its outputs in both immediate reality and relevant historical memory.

Query Input Format:

  • "Question: <|Que|>"
  • Referenced video information: "<|Img|>"
  • Retrieved memory: "<|Mem|>"

This format aligns with experimental protocols for grounding reasoning in both spatial and temporal context.

6. Innovations, Technical Challenges, and Model Comparison

IXC2.5-OL’s primary technical innovations include:

  • Disentangled workflow: streaming perception, compressed memory, adaptive reasoning, in contrast to monolithic sequence models
  • Real-time multimodal integration: concurrent processing through dedicated threads for each modality
  • Memory-efficient compression: LLM-based compressor mitigates prohibitive long-context cost
  • Robust query filtering: instruction prediction mechanism ensures only valid user input triggers reasoning

These shift the model paradigm from static context-limited pairs towards continuous, context-rich cognition. Compared to prior InternLM-XComposer releases (Zhang et al., 2023, Dong et al., 29 Jan 2024, Zhang et al., 3 Jul 2024), IXC2.5-OL extends functionality from free-form interleaved text-image tasks and ultra-long-context support to persistent real-time multimodal interaction and memory.

7. Applications and Future Directions

Practical scenarios include:

  • Persistent AI assistants (smart home, surveillance) maintaining contextual relevance and historical continuity
  • Real-time multimodal interfaces for events, education, or collaborative environments
  • Advanced multimedia search, retrieval, and question answering from extensive video streams
  • Testbed for AI systems simulating human-like cognitive architectures—perception, memory, and reasoning as dynamically interacting yet specialized subsystems

Prospective research efforts include extending "long-context video understanding"—enabling entire movie-level analysis—and improving multi-turn, multi-image dialogue retention. The open-source ethos (Zhang et al., 12 Dec 2024) facilitates broad adoption, adaptation, and continued progression towards more adaptive, context-aware multimodal AI systems.


In summary, InternLM-XComposer2.5-OmniLive introduces a modular, real-time approach to streaming multimodal perception, memory compression, and query-driven reasoning, marking a substantial advance in continuous interactive AI and efficient long-term multimodal cognition.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to InternLM-XComposer2.5-OmniLive (IXC2.5-OL).