Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 12 tok/s Pro

GPT-5 High 21 tok/s Pro

GPT-4o 81 tok/s Pro

Kimi K2 231 tok/s Pro

GPT OSS 120B 435 tok/s Pro

Claude Sonnet 4 33 tok/s Pro

2000 character limit reached

Qwen2.5-Omni Technical Report (2503.20215v1)

Published 26 Mar 2025 in cs.CL, cs.CV, cs.SD, and eess.AS

Abstract: In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a LLM tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Summary

The paper introduces the Thinker-Talker architecture and TMRoPE, enabling synchronized real-time text and speech generation.
It employs specialized encoders for text, audio, images, and video to achieve state-of-the-art multimodal performance on the Omni-Bench suite.
Training in three phases with extensive multimodal datasets underscores its capability for coherent, low-latency streaming responses.

Introduction

The "Qwen2.5-Omni Technical Report" delineates the development of Qwen2.5-Omni, an advanced multimodal model capable of integrating text, audio, images, and video inputs. The model is designed to deliver both real-time text and speech outputs, enhancing human-computer interaction through seamless integration of multimodal information. This document explores the architectural innovations, encoding strategies, and the performance benchmarks of Qwen2.5-Omni.

Model Architecture and Design

Thinker-Talker Architecture

The Qwen2.5-Omni employs the Thinker-Talker architecture (Figure 1), wherein the Thinker acts as a LLM responsible for text generation, and the Talker generates streaming speech. This separation facilitates the efficient handling of text and spoken outputs while maintaining synchronization via shared high-level representations.

Figure 1: The overview of Qwen2.5-Omni. Qwen2.5-Omni adopts the Thinker-Talker architecture. Thinker is tasked with text generation, while Talker focuses on generating streaming speech tokens by receiving high-level representations directly from Thinker.

Positional Encoding with TMRoPE

To synchronize multimodal inputs, the Qwen2.5-Omni applies a novel Time-aligned Multimodal Rotary Position Embedding (TMRoPE) (Figure 2). This encoding scheme effectively integrates temporal information across audio and video modalities, ensuring accurate synchronization within the model.

Figure 2: An illustration of Time-aligned Multimodal RoPE~(TMRoPE).

Data Processing and Encoding

Multimodal Input Handling

The model processes multimodal inputs through various encoders: a tokenizer for text, a ViT-based vision encoder for images and video, and an audio encoder converting waveforms into mel-spectrograms. This structured approach ensures comprehensive perception across modalities by leveraging specialized encoders for each type of data.

Streaming Architecture

A distinct feature of the Qwen2.5-Omni model is its capability for real-time processing. Both audio and video encoders use a block-wise processing approach, facilitating low-latency streaming and coherent synthesis of speech and text outputs. The sliding window attention mechanism within the Talker (Figure 3) supports this by minimizing initial response delays, enabling near-immediate feedback.

Figure 3: An illustration of sliding window block attention mechanism in DiT for codec to wav generation.

Training and Performance Evaluation

Training Methodology

The training of Qwen2.5-Omni was conducted in three phases, initially isolating the encoders by freezing LLM parameters, then expanding to comprehensive multimodal datasets, and finally stretching sequence handling capacity. Fine-tuning stages employed extensive datasets comprising multimodal combinations, with instruction-following datasets formatted in ChatML enhancing conversational abilities.

Empirical Results

Benchmarks reveal that Qwen2.5-Omni excels in multimodal tasks, particularly on the Omni-Bench suite, achieving state-of-the-art performance. Its proficiency in processing mixed modalities, such as audio-video synchronization in real-time, highlights the model's advanced capabilities in coherent speech instruction following and text-audio integration tasks.

Conclusion

The Qwen2.5-Omni model represents a significant step forward in multimodal AI, offering an integrative approach to processing diverse data types. By employing advanced architectural designs and encoding mechanisms, such as the Thinker-Talker architecture and TMRoPE, the model proficiently manages real-time text and speech generation across multiple input modalities. This technical advancement demonstrates potential for broad applications in interactive systems, positioning Qwen2.5-Omni as a versatile tool in the development of more intuitive AI interfaces. Future work will explore the expansion of Qwen2.5-Omni’s capabilities in generating other multimodal outputs, including visual and musical data, to further bridge the gap towards embodied artificial intelligence.