- The paper introduces a full-duplex omni-modal framework that enables real-time, proactive integration of vision, audio, speech, and text.
- It leverages unified serialization and time-aligned interleaving (TAIL) to maintain continuous context and smooth output generation.
- Empirical evaluations demonstrate competitive performance on multimodal benchmarks with efficient edge deployment under 12GB RAM and over 200 tokens/s.
MiniCPM-o 4.5: Toward Real-Time Full-Duplex Omni-Modal Interaction
Introduction and Motivation
MiniCPM-o 4.5 (2604.27393) addresses critical limitations in state-of-the-art multimodal LLMs (MLLMs), focusing on the interaction paradigm rather than just modality coverage or raw inference latency. Conventional MLLMs are primarily turn-based, with perception (input understanding) and generation (output response) occurring in serialized, alternating phases. This separation impedes the model's ability to continuously update its outputs with new context and prohibits proactive, context-driven behaviors, which are characteristic of human-like interaction.
MiniCPM-o 4.5 introduces a real-time, full-duplex, omni-modal interaction framework that enables simultaneous “seeing”, “listening”, and “speaking”—not only reacting to explicit user input, but also proactively producing outputs informed by ongoing, dynamic environmental context. The technical core is the Omni-Flow framework, which aligns multimodal inputs/outputs along a shared temporal axis, converting traditional, alternated interaction into a continuous, time-synchronized process.
Architectural Innovations
The architecture of MiniCPM-o 4.5 comprises three principal modules:
- Multimodal Encoders: Visual input processing is handled via a SigLIP ViT-based encoder with a resampler for image compression, supporting both high-resolution processing (up to 2240×2240) in turn-based mode and efficient 448×448 resolution in streaming mode. Audio is encoded using a chunk-based Whisper Medium configuration, compressed to maintain manageable sequence lengths for the LLM backbone.
- LLM Backbone: Built on Qwen3-8B, this backbone provides omni-modal reasoning, text generation, and context representation. Speech token generation is not handled directly by the backbone to avoid efficiency bottlenecks and linguistic degradation; instead, lightweight speech decoders process contextualized outputs into speech tokens and ultimately audio waveforms.
- Speech Decoders: An interleaved speech token decoder produces discrete S3 tokens, which a streaming flow-matching decoder converts into audio. Contextual and prosodic control is delegated to the backbone, letting the speech decoder focus on high-fidelity waveform reconstruction.
A notable design principle is complete end-to-end differentiability, with all components connected at the token level, enabling gradient flow across the entire omni-modal pipeline during training.
Omni-Flow: Full-Duplex, Proactive Streaming
Omni-Flow fundamentally redefines the serialization of multimodal interaction. By partitioning the continual, real-world signal into fine-grained time chunks, it enables the model to process new perceptual input and generate new output in closely aligned intervals. User queries become part of the continuous environment context, and the assistant can initiate outputs based on incremental observations, not merely explicit requests.
Key mechanisms within Omni-Flow include:
- Time-Aligned Interleaving (TAIL): Maintains tight temporal coupling between generated speech and evolving context using a variable-length, chunk-wise interleaving schedule. This overcomes the misalignment seen in fixed-ratio or lagging speech synthesis.
- Unified Serialization: All input and output streams (visual, audio, speech, text) are serialized with explicit temporal boundaries, making time in the world explicit as a modeling axis. This enables proactive output and timely adjustment to new context.
Design tradeoffs—such as temporal granularity (chunk size), boundary explicitness, and separation of interaction control from content generation—are systematically ablated and optimized, with explicit boundary tokens and Listen-Speak decoupling yielding the most stable performance.
Training Methodology
MiniCPM-o 4.5 employs a multistage training pipeline:
- Speech Pretraining: Adapts the audio pathway with Whisper and aligns it to the LLM backbone, initializing speech components while freezing pretrained vision and language backbones.
- Joint Pretraining: Simultaneously exposes the model to balanced mixtures of vision-language, speech, and full-duplex omni-modal data; unified next-token prediction loss is used across all modalities and streams.
- Supervised Fine-Tuning: Large-scale instruction tuning and human-annotated scenarios further refine omni-modal behavior and instruction following.
- Reinforcement Learning: Custom RL objectives are incorporated, including GRPO and a new smooth length reward for balancing response length with informativeness, as well as RLAIF-V for hallucination mitigation.
Datasets span large-scale, unlabeled audio for pretraining, curated vision-language and video data, and new full-duplex omni-modal samples with per-token time annotations.
Empirical Evaluation and Results
Vision-Language Understanding
MiniCPM-o 4.5 demonstrates competitive or superior performance for its 9B parameter scale, achieving an average score of 77.6/78.2 on OpenCompass (instruct/thinking modes), surpassing both similarly-sized (InternVL3.5-8B, Qwen3-VL-8B) and larger models (Qwen3-Omni-30B) in several domains. It is particularly strong on document, OCR, and multi-image understanding, outperforming Qwen3-Omni-30B-A3B on OmniDocBench, Mantis-Eval, and MMSI-Bench.
Speech Understanding and Generation
The model achieves the lowest CER/WER on SeedTTS benchmarks for both Chinese and English, evidencing robust cross-lingual TTS capability. Long-form generation is also superior, with significantly reduced English WER on LongTTS compared to baselines. Emotion and style control (Expresso, ESD) are advanced, suggesting the benefit of context integration via Omni-Flow.
Full-Duplex Omni-Modal Streaming
On LiveSports-3K-CC, the leading benchmark for real-time full-duplex vision-only interaction, MiniCPM-o 4.5 achieves a win rate of 54.4—outperforming LiveCC and Streaming VLM by substantial margins. In all tested omni-modal streaming benchmarks, it is competitive with or superior to proprietary models, despite its smaller size.
Efficiency
MiniCPM-o 4.5 is optimized for edge deployment, requiring less than 12GB RAM for inference and achieving >200 tokens/s with INT4 quantization on commodity GPUs. The custom llama.cpp-omni framework ensures sub-realtime inference with a reduced memory footprint, supporting practical deployment on a wide spectrum of hardware.
Limitations and Future Directions
The model still exhibits instability in long-horizon context tracking, occasional speech mispronunciation, and rare cross-lingual code-mixing in speech. Its proactive behaviors are relatively simplistic, and richer agentic capabilities (such as context-sensitive initiative, task planning, or more sophisticated multimodal attention routing) remain for future exploration. Benchmark availability for comprehensive, real-time, full-duplex, omni-modal evaluation is limited, and further advances in data curation and synthetic scenario construction will be essential.
Implications and Outlook
The MiniCPM-o 4.5 model, with its unified, temporally aligned, end-to-end architecture and the Omni-Flow paradigm, exemplifies a decisive advancement toward interactive, always-on AI agents that mirror human-computer interaction cues. Its public release at the 9B scale, open-source, with practical edge deployment, sets a new accessibility mark for advanced MLLMs. The shift to full-duplex, proactive, omni-modal modeling foreshadows future AI systems where real-time, context-driven interaction and initiative will be foundational requirements—especially in mobile, robotics, ambient computing, and assistive technology applications.
Developing more robust, contextually aware, and self-initiating behaviors—perhaps through even tighter sensorimotor grounding, agentic planning, and offline-to-online learning transfer—constitutes the natural next step. Additionally, as edge deployment becomes mainstream, further architectural efficiency and quantization strategies will likely grow in importance, along with hardware-software co-design for latency-constrained scenarios.
Conclusion
MiniCPM-o 4.5 establishes a new interaction paradigm for MLLMs, demonstrating that full-duplex, temporally aligned, proactive omni-modal interaction is achievable at open-source, edge-friendly scales. By aligning vision, audio, speech, and text processing and output along a shared, real-time axis, the model closes gaps in dynamic, human-like interaction that previous turn-based models could not address. Its empirical results, architectural efficiency, and deployment versatility position it as a reference model for the future development of human-level interactive AI.