Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen3.5-Omni: Next-Gen Omnimodal LLM

Updated 21 April 2026
  • Qwen3.5-Omni is a next-generation omnimodal LLM that unifies text, vision, audio, and video modalities using a novel Thinker–Talker architecture and Hybrid-Attention MoE backbone.
  • It achieves state-of-the-art performance across 215 tasks through extensive pretraining on trillions of tokens and supports ultra-long context windows up to 256K tokens.
  • The model demonstrates emergent capabilities in omnimodal reasoning, synchronized speech synthesis, and hierarchical captioning, making it effective for real-time, multimodal applications.

Qwen3.5-Omni is a next-generation omnimodal LLM that unifies text, vision, audio, and audio-visual understanding and generation within a single end-to-end agent. Scaling to hundreds of billions of parameters and supporting a context window of up to 256,000 tokens (approximately 10 hours of audio or 400 seconds of 720P video at 1 FPS), Qwen3.5-Omni is designed for robust, temporally precise multimodal modeling. The model achieves state-of-the-art (SOTA) results over 215 audio and audio-visual subtasks, surpassing Gemini-3.1 Pro on key benchmarks and demonstrating strong emergent capabilities such as omnimodal code generation (Team, 17 Apr 2026).

1. Technical Architecture and Design

Qwen3.5-Omni implements a two-stage Thinker–Talker architecture built atop a Hybrid-Attention Mixture-of-Experts (MoE) Transformer. The Thinker stage ingests all incoming modalities—including text, images, audio, and silent video—via modality-specific encoders: a Byte Pair Encoding (BPE) tokenizer for text, a vision encoder inherited from Qwen3.5-VL for images, and a custom “Audio Transformer” (AuT) for audio. All inputs are temporally interleaved with explicit textual timestamps (e.g., “[Time 12.3 s]”) to maintain robust temporal modeling across very long context windows, obviating the need for sparse RoPE-based indexings.

Within each Thinker layer, dense multi-headed self-attention alternates with a sparse Mixture-of-Experts feed-forward block, while Gated Delta Net (GDN) modules accelerate KV-cache input/output operations during long-sequence inference. The Talker consumes Thinker’s hidden representations and text emissions to autoregressively predict multi-codebook speech codec tokens. Its decoder stack includes a Multi-Token Prediction (MTP) head for 40 ms frame-level residual codebooks and a streaming, causal convolutional network (“Code2Wav”) for waveform synthesis at sub-5 ms latency. The ARIA (Adaptive Rate Interleave Alignment) module enforces tight alignment between text and speech generation.

Key Architectural Hyperparameters

Parameter Value/Characteristic
Total parameters O(1011)–O(1012) (hundreds of billions)
Hidden size ≈12,288
Attention heads/layer 96
Experts per MoE layer 128 (top-2 active)
Layers (Thinker/Talker) 60 / 16
Max context window 256K tokens (≈10 h audio or 400 s video)

(Team, 17 Apr 2026)

2. Training Corpus and Pretraining Protocol

Pretraining proceeds in three distinct stages:

  1. Encoder Alignment: The LLM is frozen, while image and audio encoders are aligned to the representation space using lightweight adapters.
  2. General Multimodal Pretraining: All model parameters are unfrozen; the model is exposed to approximately 4 trillion tokens, with a sequence length of 32,000 tokens. The training data includes:
    • 0.92T pure text tokens.
    • 1.99T audio-text pairs, drawn from over 100 million hours of audio-visual data.
    • 0.95T image-text tokens.
    • 0.14T video-text and 0.29T video-audio pairs.
  3. Long-Context Adaptation: The context window is expanded to 256K, with a higher fraction of long audio and video sequences to ensure stable, long-horizon modeling.

A single unified autoregressive cross-entropy objective is applied across modalities. Formally, for modality mm and paired sequence (x,y)m(x, y)_m, the loss is

m=t=1ylogp(yty<t,x;θ)\ell_m = -\sum_{t=1}^{|y|} \log p(y_t | y_{<t}, x; \theta)

Different λm\lambda_m are used to balance tasks during optimization:

L=mλmmL = \sum_m \lambda_m \ell_m

(Team, 17 Apr 2026)

3. Hybrid Attention Mixture-of-Experts (MoE) Framework

The Hybrid-Attention MoE backbone alternates between dense self-attention (using FlashAttention-2 for computational efficiency) and sparse feed-forward blocks with top-2 MoE routing. For input xRdx \in \mathbb{R}^d and EE experts, the routing network GG computes

g=softmax(G(x))ΔEg = \text{softmax}\big(G(x)\big) \in \Delta^E

i,j=top2(g)i, j = \text{top2}(g)

(x,y)m(x, y)_m0

Only the two largest (x,y)m(x, y)_m1 values are used (top-2 gating), reducing computational load. FlashAttention-2 accelerates each multi-head self-attention block. GDN modules further improve cache efficiency and throughput for very long sequences (Team, 17 Apr 2026).

4. ARIA Alignment for Streaming Speech Synthesis

To address synchronization between text and speech streams during streaming TTS, Qwen3.5-Omni introduces ARIA—Adaptive Rate Interleave Alignment. ARIA enforces a global monotonicity constraint:

(x,y)m(x, y)_m2

where (x,y)m(x, y)_m3 is the accumulated speech tokens and (x,y)m(x, y)_m4 text tokens up to step (x,y)m(x, y)_m5, with (x,y)m(x, y)_m6 as the global speech-to-text token rate learned from the pretraining corpus. At each decoding step, the model computes logits for both modalities; if (x,y)m(x, y)_m7, only text tokens are allowed, otherwise the decoder selects next by maximizing over both channels. This approach minimizes misalignment and unnatural prosody without introducing additional aligners or latency (Team, 17 Apr 2026).

5. Benchmark Performance and Evaluation

Qwen3.5-Omni-Plus establishes SOTA performance across an extensive benchmark suite of 215 tasks, notably in audio and audio-visual reasoning, surpassing or matching Gemini-3.1 Pro. Selected performance highlights:

Task Gemini-3.1 Pro Qwen3.5-Omni-Plus
MMAU (Accuracy) 81.1 82.2
RUL-MuchoMusic 59.6 72.4
Fleurs (ASR WER↓) 7.32 6.55
VoiceBench (Speech) 88.9 93.1
xx↔zh BLEU (top 59) 32.1 32.8
DailyOmni (AV) 82.7 84.6

Multilingual zero-shot TTS yields WER = 0.99 (zh) / 1.26 (en); in voice cloning, cross-lingual zh→ko error drops from 14.4 to 4.03 (−72%). Human evaluation confirms emotional prosody control within 1 dB and (x,y)m(x, y)_m8 contours. Vision tasks match or exceed specialized unimodal baselines (Team, 17 Apr 2026).

6. Key Omnimodal Capabilities and Emergent Behaviors

Qwen3.5-Omni supports:

  • Ultra-Long-Sequence Inference: Handles up to 10 hours of audio or 400 seconds of video per session with sub-second latency and stable KV-cache operation.
  • Multilingual and Emotional Speech: Recognizes 113 speech varieties (74 languages and 39 Chinese dialects); synthesizes 36 varieties with nuanced emotional control and SIM ≈ 0.80 in voice cloning from 3-s prompts.
  • Hierarchical Script-Level Captioning: Generates temporally-resolved captions, explicit scene segmentation, action/character descriptions, and event-aligned audio in structured (e.g., JSON) outputs.
  • Audio-Visual Vibe Coding: Emergent ability to perform code generation from audio-visual instructions. Given, e.g., a music loop and "Generate Python code to plot its spectrogram," the model produces a valid Matplotlib script, integrating perception, reasoning, and action within a single decoding pass.

These capabilities illustrate the frontiers of end-to-end omnimodal reasoning, grounding, and real-time multimodal interaction (Team, 17 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen3.5-Omni.