Qwen2.5-Omni Technical Report (2503.20215v1)

Published 26 Mar 2025 in cs.CL, cs.CV, cs.SD, and eess.AS

Abstract: In this report, we present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. To synchronize the timestamps of video inputs with audio, we organize the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE(Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose \textbf{Thinker-Talker} architecture. In this framework, Thinker functions as a LLM tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni is comparable with the similarly sized Qwen2.5-VL and outperforms Qwen2-Audio. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench. Notably, Qwen2.5-Omni's performance in end-to-end speech instruction following is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni's streaming Talker outperforms most existing streaming and non-streaming alternatives in robustness and naturalness.

Summary

Qwen2.5-Omni: An Advanced Multimodal Interaction Model

The technical report on Qwen2.5-Omni presents a sophisticated approach to developing an end-to-end multimodal interaction model. The model is engineered to perceive various input modalities such as text, images, audio, and video, while simultaneously generating coherent textual and spoken language outputs. This intricate capability is realized through a series of methodical innovations in model architecture and training strategy.

Qwen2.5-Omni is distinctive for its adoption of block-wise processing within audio and visual encoders, which enables the model to manage long sequences of multimodal data efficiently. The division of labor between a multimodal encoder, responsible for perceiving input across multiple modes, and a LLM for sequence modeling, optimizes the information processing workflow. The shared attention mechanism facilitates effective fusion of multimodal inputs, enhancing the system's interpretative accuracy.

One of the novel contributions of Qwen2.5-Omni is the development of TMRoPE (Time-aligned Multimodal Rotary Position Embedding), designed to synchronize temporal features across audio and video inputs via an interleaved sequence structure. This allows the model to align concurrent audio and video data accurately, crucial for tasks that require integrated processing of these modalities.

The Thinker-Talker architecture is another pivotal innovation enabling Qwen2.5-Omni to handle parallel text and speech generation without interference between these processes. In this architecture, "Thinker" is responsible for generating textual responses, akin to a typical LLM, while "Talker" employs a dual-track autoregressive model leveraging the representations from Thinker to produce speech. This configuration is trained and inferred end-to-end, ensuring seamless operation and real-time capabilities.

For decoding, the paper introduces a sliding-window DiT method that confines the receptive field to reduce the initial delay in audio token processing, essential for maintaining the model's responsivity in real-time tasks.

Empirically, Qwen2.5-Omni exceeds the performance of its predecessors, notably Qwen2.5-VL and Qwen2-Audio, across various benchmarks. It demonstrates state-of-the-art performance in multimodal tasks such as OmniBench and exhibits robust abilities in speech-generation tasks, achieving low word error rates (WER) on challenging datasets. Its capability to manage complex audio-visual interactions and generate natural and contextually aware speech is notable, suggesting a significant advancement in multimodal AI interactions.

The implications of this research are manifold. Practically, Qwen2.5-Omni could enhance user interactions with AI systems by enabling more natural and fluid exchanges, particularly in environments where multimodal input processing is essential, such as virtual assistants and educational tools. Theoretically, the model paves the way for deeper integration of diverse sensory data in AI, setting a precedent for future developments aiming to mimic human-like understanding and response capabilities.

Looking forward, the evolution of such models could encompass expanded output modalities beyond speech and text, potentially including music or graphical outputs, to enhance interaction. Furthermore, advancements in reducing latency and increasing efficiency will be crucial for deploying these models in real-world, resource-constrained settings.

Related Papers

Tweets

https://twitter.com/andimarafioti/status/1909255709503463723

https://twitter.com/NielsRogge/status/1905205508442591706

https://twitter.com/thestoicccoder/status/1907143893344481412

https://twitter.com/akv_13/status/1924524564643774893

https://twitter.com/sagevedant/status/1907100561037726054

https://twitter.com/susumuota/status/1906864304600539347

YouTube

Show All Videos