Papers
Topics
Authors
Recent
2000 character limit reached

InteractiveOmni: Unified Omni-modal AI

Updated 1 January 2026
  • InteractiveOmni is a unified omni-modal system that integrates audio, visual, and text streams to enable real-time, multi-turn interactive dialogue.
  • It utilizes a multi-stage pre-training strategy with contrastive and autoregressive losses to achieve robust cross-modal alignment and efficient streaming inference.
  • The architecture combines modular encoders and decoders, delivering competitive performance on benchmarks for vision, audio, and speech tasks.

InteractiveOmni is a class of unified omni-modal models and systems that achieve concurrent audio-visual-text understanding, long-horizon multi-turn memory, and real-time speech interaction within a single network. This category encompasses both foundational models—such as those described in "InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue"—and system-level engineering pipelines, including inference-optimized MLLM frameworks and interactive agent stacks for real-world multimodal environments. Representative implementations combine large-scale pre-training over joint audio-visual-text data, contrastively aligned cross-modal encoders, and autoregressive decoders capable of generating both text and speech, yielding highly interactive, open-source foundations that are competitive with larger proprietary baselines in key multimodal domains (Tong et al., 15 Oct 2025, Li et al., 2024).

1. Architectural Composition and Modality Fusion

The canonical InteractiveOmni system integrates four principal modules: a vision encoder (e.g., InternViT-300M), an audio encoder (e.g., Whisper-large-v3), an autoregressive LLM decoder (Qwen3-4B or Qwen3-8B), and a streaming speech decoder (CosyVoice2). Visual, acoustic, and textual signals are projected into a shared embedding space and concatenated as a unified input sequence to the decoder. Each decoder forward pass can emit interleaved outputs—text tokens and, in speech turns, discrete speech tokens interpreted by the separate speech-token LLM and ultimately synthesized by a neural vocoder. This tight integration allows concurrent understanding and generation of text and speech within a multi-turn interactive dialogue context (Tong et al., 15 Oct 2025).

A schematic flow is as follows:

  • Visual input xvx_v → Ev(xv)\mathrm{Ev}(x_v)
  • Audio input xax_a → Ea(xa)\mathrm{Ea}(x_a)
  • Text prompt xtx_t → token embeddings
  • Concatenate [V-features; A-features; Tinput]→[\mathrm{V\text{-}features};\,\mathrm{A\text{-}features};\,T_\text{input}] \rightarrow LLM →\rightarrow (text tokens ∣| interleaved speech tokens) →\rightarrow CosyVoice2 →\rightarrow speech waveform

The end-to-end stack is compatible with variable-length input, large memory buffers (up to 32k tokens), and supports streaming for low-latency applications.

2. Multi-Stage Pre-training and Instruction Tuning

InteractiveOmni establishes robust cross-modal alignment via a multi-stage strategy:

  • Stage 1: Vision-Text Pretraining: Learn semantic alignment between image/video representations and language, initializing from a large vision transformer (Ev) and text backbone (LLM).
  • Stage 2: Audio-Text Pretraining: Train the audio encoder (Ea) to align audio features with text, leveraging large-scale ASR and audio QA datasets.
  • Stage 3: Mixed Modality Packing: Fine-tune with interleaved (audio, visual, text) input and combined contrastive plus autoregressive loss, ensuring joint fusion.
  • Stage 4: Instruction and Dialogue Fine-tuning: Use instruction-tuned datasets for speech-to-text QA, image+speech-to-text QA, and speech-to-speech dialogue. All encoder and decoder components remain trainable. Post-training tasks include hard-sample mining, direct preference optimization (DPO), and model merging for optimal multi-modal capability.

Data packing during these stages enables handling of long-form sequences and multi-turn dependencies. Such strategies are critical to achieving human-like conversational ability and cross-modal reasoning depth (Tong et al., 15 Oct 2025, Li et al., 2024).

3. Multi-Turn Memory and Benchmarking

InteractiveOmni models are explicitly evaluated on constructed multi-modal benchmarks that test long-horizon memory, multi-step reasoning, and multi-turn speech interaction.

  • Multi-Modal Multi-Turn Memory Benchmark (MMMB): Consists of 300 dialog groups (up to 15 turns each), systematically probing text-only, image-only, and mixed memory dependencies. Only the final turn is scored, requiring the model to recall historical states from potentially distant context turns.
  • Multi-Turn Speech Interaction Benchmark (MSIB): Contains 244 spoken dialogs with 2–10 turns per dialog, scoring both content and speech quality across conversational, emotional, prosodic, creative, and instruction-following axes.

Performance metrics include final-turn accuracy, mean opinion score (MOS), and automated LLM-based judging. Notably, InteractiveOmni-4B and 8B match or exceed Qwen2.5-Omni-7B, with text recall at 70–73% and image recall at 30–40% in MMMB; MSIB content and speech scores reach 4.05 out of 5, leading among comparably sized open models (Tong et al., 15 Oct 2025).

4. Streaming Inference and Real-Time Interaction

InteractiveOmni systems implement real-time, low-latency streaming over audio and video in live dialogue. Key mechanisms include:

  • Lightweight audio boundary detection for precise speech segmentation.
  • Frame-level streaming: On-the-fly encoding of audio/video frames, continuously feeding the LLM context buffer.
  • Parallel batch processing to minimize system overhead; observed latency is 120 ms for frame+audio-to-token generation on a single modern GPU (Li et al., 2024).

API endpoints are designed for live applications: /send_audio_chunk for ongoing speech, /send_image_frame for vision updates, and /generate_response for incremental decoding. This architecture underpins use cases in live video Q&A, audio-driven image editing, and multimodal meeting assistance.

Hardware specifications for deployment range from edge (quantized 7B-8B models on 20 GB GPU RAM) to full-precision inference on ≥40 GB GPUs.

5. Empirical Performance and Comparative Results

InteractiveOmni achieves state-of-the-art or near-SOTA results on diverse multimodal tasks across domains, with the following summary:

Task InteractiveOmni-4B InteractiveOmni-8B Qwen2.5-Omni-7B Baseline models
MMBench-V1.1 (Vision) 68.6% 73.2% 69.5% InternVL3.5-8B: 74.7%
Video-MME (Video) 64.4% 67.1% 64.5% InternVL3.5-8B: 66.7%
LibriSpeech test-other (Audio WER) 3.69 3.41 3.40 WenetSpeech: 5.04–5.90
MMAU/AIR/MELD/ClothoAQA (Audio) 72.0/6.6/57.2 67.4/6.5/57.6 65.6/6.9/57.0 –
OmniBench (Speech/Sound/Music) 59.2 60.3 56.1 –
Speech2Text QA (OpenAudioBench) 69.1 72.7 66.3 –

InteractiveOmni-4B retains 97% of 8B performance, offering strong compression relative to larger proprietary baselines but at near-SOTA or SOTA results for a wide range of image, audio, video, and spoken dialog tasks (Tong et al., 15 Oct 2025).

6. Design Implications and Emerging Research Recommendations

InteractiveOmni research highlights the importance of:

  • Unified modality representation and fusion via shared decoder space, with explicit cross-modal contrastive alignment during pre-training.
  • Multi-turn data construction that systematically varies historical depth and modal dependency, essential to evaluate real-world conversational agents.
  • Modular system architecture supporting streaming, interleaved token sequences, and parallel decoding for real-time deployments in both cloud and edge settings.
  • Instruction-tuned and preference-optimized fine-tuning regimes (notably DPO) to yield high-quality, emotionally responsive speech synthesis.
  • Systematic benchmarking on multi-turn memory and speech interaction tasks to drive diagnostic analysis and ablation-based improvements.

Ongoing challenges include scaling to longer context windows, tighter audio-visual integration for grounded dialogue, and robust handling of multi-party, safety-critical, or embodied scenarios (Wang et al., 29 Mar 2025, Li et al., 2024).

7. Impact, Applications, and Outlook

InteractiveOmni represents a converged research direction in omni-modal LLMs. Released models and systems have enabled:

  • Open-source, low-latency multimodal conversational agents competitive with much larger proprietary systems.
  • Strong empirical results in image understanding (MMBench, MathVista), audio/video QA, speech recognition and TTS, and complex multi-turn dialogue.
  • Versatile deployments, from assistive technologies for the visually impaired to real-time video Q&A and audio-driven editing tools.

By combining large-scale pre-training, modularized multi-modal encoding, and advanced streaming decoding, InteractiveOmni delivers a foundation for next-generation research in interactive AI. Continued advances in cross-modal optimization, efficient inference, and comprehensive benchmarking are expected to further narrow the gap between open-source and proprietary systems, with broad implications for research in embodied AI, situated dialogue, and real-world human–computer interaction (Tong et al., 15 Oct 2025, Li et al., 2024, Wang et al., 29 Mar 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to InteractiveOmni.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube