Qwen3.5-Omni: Next-Gen Omnimodal LLM
- Qwen3.5-Omni is a next-generation omnimodal LLM that unifies text, vision, audio, and video modalities using a novel Thinker–Talker architecture and Hybrid-Attention MoE backbone.
- It achieves state-of-the-art performance across 215 tasks through extensive pretraining on trillions of tokens and supports ultra-long context windows up to 256K tokens.
- The model demonstrates emergent capabilities in omnimodal reasoning, synchronized speech synthesis, and hierarchical captioning, making it effective for real-time, multimodal applications.
Qwen3.5-Omni is a next-generation omnimodal LLM that unifies text, vision, audio, and audio-visual understanding and generation within a single end-to-end agent. Scaling to hundreds of billions of parameters and supporting a context window of up to 256,000 tokens (approximately 10 hours of audio or 400 seconds of 720P video at 1 FPS), Qwen3.5-Omni is designed for robust, temporally precise multimodal modeling. The model achieves state-of-the-art (SOTA) results over 215 audio and audio-visual subtasks, surpassing Gemini-3.1 Pro on key benchmarks and demonstrating strong emergent capabilities such as omnimodal code generation (Team, 17 Apr 2026).
1. Technical Architecture and Design
Qwen3.5-Omni implements a two-stage Thinker–Talker architecture built atop a Hybrid-Attention Mixture-of-Experts (MoE) Transformer. The Thinker stage ingests all incoming modalities—including text, images, audio, and silent video—via modality-specific encoders: a Byte Pair Encoding (BPE) tokenizer for text, a vision encoder inherited from Qwen3.5-VL for images, and a custom “Audio Transformer” (AuT) for audio. All inputs are temporally interleaved with explicit textual timestamps (e.g., “[Time 12.3 s]”) to maintain robust temporal modeling across very long context windows, obviating the need for sparse RoPE-based indexings.
Within each Thinker layer, dense multi-headed self-attention alternates with a sparse Mixture-of-Experts feed-forward block, while Gated Delta Net (GDN) modules accelerate KV-cache input/output operations during long-sequence inference. The Talker consumes Thinker’s hidden representations and text emissions to autoregressively predict multi-codebook speech codec tokens. Its decoder stack includes a Multi-Token Prediction (MTP) head for 40 ms frame-level residual codebooks and a streaming, causal convolutional network (“Code2Wav”) for waveform synthesis at sub-5 ms latency. The ARIA (Adaptive Rate Interleave Alignment) module enforces tight alignment between text and speech generation.
Key Architectural Hyperparameters
| Parameter | Value/Characteristic |
|---|---|
| Total parameters | O(1011)–O(1012) (hundreds of billions) |
| Hidden size | ≈12,288 |
| Attention heads/layer | 96 |
| Experts per MoE layer | 128 (top-2 active) |
| Layers (Thinker/Talker) | 60 / 16 |
| Max context window | 256K tokens (≈10 h audio or 400 s video) |
2. Training Corpus and Pretraining Protocol
Pretraining proceeds in three distinct stages:
- Encoder Alignment: The LLM is frozen, while image and audio encoders are aligned to the representation space using lightweight adapters.
- General Multimodal Pretraining: All model parameters are unfrozen; the model is exposed to approximately 4 trillion tokens, with a sequence length of 32,000 tokens. The training data includes:
- 0.92T pure text tokens.
- 1.99T audio-text pairs, drawn from over 100 million hours of audio-visual data.
- 0.95T image-text tokens.
- 0.14T video-text and 0.29T video-audio pairs.
- Long-Context Adaptation: The context window is expanded to 256K, with a higher fraction of long audio and video sequences to ensure stable, long-horizon modeling.
A single unified autoregressive cross-entropy objective is applied across modalities. Formally, for modality and paired sequence , the loss is
Different are used to balance tasks during optimization:
3. Hybrid Attention Mixture-of-Experts (MoE) Framework
The Hybrid-Attention MoE backbone alternates between dense self-attention (using FlashAttention-2 for computational efficiency) and sparse feed-forward blocks with top-2 MoE routing. For input and experts, the routing network computes
0
Only the two largest 1 values are used (top-2 gating), reducing computational load. FlashAttention-2 accelerates each multi-head self-attention block. GDN modules further improve cache efficiency and throughput for very long sequences (Team, 17 Apr 2026).
4. ARIA Alignment for Streaming Speech Synthesis
To address synchronization between text and speech streams during streaming TTS, Qwen3.5-Omni introduces ARIA—Adaptive Rate Interleave Alignment. ARIA enforces a global monotonicity constraint:
2
where 3 is the accumulated speech tokens and 4 text tokens up to step 5, with 6 as the global speech-to-text token rate learned from the pretraining corpus. At each decoding step, the model computes logits for both modalities; if 7, only text tokens are allowed, otherwise the decoder selects next by maximizing over both channels. This approach minimizes misalignment and unnatural prosody without introducing additional aligners or latency (Team, 17 Apr 2026).
5. Benchmark Performance and Evaluation
Qwen3.5-Omni-Plus establishes SOTA performance across an extensive benchmark suite of 215 tasks, notably in audio and audio-visual reasoning, surpassing or matching Gemini-3.1 Pro. Selected performance highlights:
| Task | Gemini-3.1 Pro | Qwen3.5-Omni-Plus |
|---|---|---|
| MMAU (Accuracy) | 81.1 | 82.2 |
| RUL-MuchoMusic | 59.6 | 72.4 |
| Fleurs (ASR WER↓) | 7.32 | 6.55 |
| VoiceBench (Speech) | 88.9 | 93.1 |
| xx↔zh BLEU (top 59) | 32.1 | 32.8 |
| DailyOmni (AV) | 82.7 | 84.6 |
Multilingual zero-shot TTS yields WER = 0.99 (zh) / 1.26 (en); in voice cloning, cross-lingual zh→ko error drops from 14.4 to 4.03 (−72%). Human evaluation confirms emotional prosody control within 1 dB and 8 contours. Vision tasks match or exceed specialized unimodal baselines (Team, 17 Apr 2026).
6. Key Omnimodal Capabilities and Emergent Behaviors
Qwen3.5-Omni supports:
- Ultra-Long-Sequence Inference: Handles up to 10 hours of audio or 400 seconds of video per session with sub-second latency and stable KV-cache operation.
- Multilingual and Emotional Speech: Recognizes 113 speech varieties (74 languages and 39 Chinese dialects); synthesizes 36 varieties with nuanced emotional control and SIM ≈ 0.80 in voice cloning from 3-s prompts.
- Hierarchical Script-Level Captioning: Generates temporally-resolved captions, explicit scene segmentation, action/character descriptions, and event-aligned audio in structured (e.g., JSON) outputs.
- Audio-Visual Vibe Coding: Emergent ability to perform code generation from audio-visual instructions. Given, e.g., a music loop and "Generate Python code to plot its spectrogram," the model produces a valid Matplotlib script, integrating perception, reasoning, and action within a single decoding pass.
These capabilities illustrate the frontiers of end-to-end omnimodal reasoning, grounding, and real-time multimodal interaction (Team, 17 Apr 2026).