SALMONN-omni: Full-Duplex Speech LLM

Updated 2 November 2025

SALMONN-omni is a standalone, codec-free full-duplex speech LLM that leverages continuous embedding streams and dynamic thinking for seamless human-computer dialogue.
It utilizes an Mamba-based encoder, Llama-3-8B-Instruct backbone, and CosyVoice2 synthesizer to manage real-time turn-taking, echo cancellation, and backchanneling effectively.
The end-to-end training with reinforcement learning fine-tuning enhances context awareness, achieving significant gains in QA accuracy and dialog robustness compared to prior systems.

SALMONN-omni denotes a standalone, codec-free, full-duplex speech LLM architecture, designed for seamless, naturalistic spoken human-computer interaction. Unlike prior modular or codec-injection-based conversational AI frameworks, SALMONN-omni achieves simultaneous speech understanding and generation exclusively via continuous embedding streams, integrating a dynamic internal state mechanism termed “thinking.” This model addresses critical challenges in dialog systems—including barge-in, turn-taking, echo cancellation, and context-dependent state prediction—by leveraging an explicit state-control strategy and unified end-to-end training. Performance evaluations demonstrate robust gains in knowledge QA, open-domain oral conversation, and full-duplex dialog robustness, with further enhancements enabled by reinforcement learning.

1. Architectural Principles: Standalone Full-duplex LLM without Codec Injection

SALMONN-omni is constructed as a single, unified system comprising:

Streaming Speech Encoder: Employs a Mamba-based architecture to extract continuous auditory embeddings at 25 Hz from all environmental audio (user, background, assistant self-speech). Teacher distillation from Whisper-large-v3 ensures high-fidelity representation.
LLM Backbone: Utilizes Llama-3-8B-Instruct, fine-tuned via LoRA (rank 32) for conversational alignment, internal state prediction, and multi-modal representation.
Streaming Speech Synthesizer: CosyVoice2-based (0.5B parameters), translating LLM output embeddings directly into speech with synchronized time-block processing.
Token Space: All inter-module communication is realized through continuous embeddings; no audio codec tokens (e.g., EnCodec sequences) are injected into the LLM vocabulary, as is the case with Moshi or SyncLLM (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025). This approach preserves paralinguistic and acoustic information and markedly reduces pipeline complexity and error accumulation.

The environment and assistant output streams are interleaved and supplied to the LLM, enabling autoregressive, block-synchronous dialog management without modular control structures.

2. Duplex Conversational State and Dynamic Thinking Mechanism

Central to SALMONN-omni’s full-duplex capabilities is an explicit “dynamic thinking” mechanism:

State Tokens: The LLM is trained to predict special tokens (>, <shift>, <listen>, <speak>) that signal conversational transitions within streaming inference. Explicit strategy (state-control tokens as both input and output) yields higher task accuracy and dialog quality than implicit approaches. > > - Autonomous State Prediction: The model leverages its autoregressive LLM generative process to decide, in real time, when to speak, listen, pause, or shift conversational context—effectively emulating human turn-taking, barge-in responsiveness, and backchanneling. > > - Echo Handling: Both user and assistant streams are processed jointly; explicit state tokens support effective echo suppression even in the presence of substantial self-speech input. > > - Periodic Synchronization: Dialogue is divided into time blocks (e.g., 80 ms windowing); within each block, input streams are encoded, state transitions predicted, and corresponding speech emitted. Output embeddings (4 tokens per block) are converted into 12 speech tokens (480 ms), maintaining ≤320 ms latency for conversation. > > ## 3. Unified End-to-end Training and Reinforcement Learning > > SALMONN-omni is trained in three sequential stages: > > 1. Connection Training: Mamba encoder-LLM connector is optimized for ASR and spoken QA using frozen encoder/LLM weights. > > 2. Synthesizer Attachment: CosyVoice2 synthesizer is integrated; model is end-to-end trained on ASR, QA, and multi-turn conversations, including synthetic barge-in and backchannel scenarios. > > 3. Reinforcement Learning (Direct Preference Optimization, DPO): RL fine-tuning calibrates state transitions and response to barge-in/backchanneling, using explicit state tokens. DPO substantially increases context awareness and dialog reliability (F1 improved from 0.86 to 0.90). > > This procedure contrasts sharply with prior models that rely on external voice activity detectors, interruption modules, or codec-injected cross-modal pipelines (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025). > > ## 4. Experimental Evaluation: Benchmarks and Performance Gains > > SALMONN-omni is evaluated across QA (Llama, Web, TriviaQA in spoken form), open-domain dialogue (AlpacaEval/Voicebench), and bespoke full-duplex dynamics (turn-taking, barge-in, backchanneling): > > | Model | Llama Q. Acc. | Web Q. Acc. | TriviaQA Acc. | AlpacaEval GPTScore | Mode | > |------------------|--------------|-------------|---------------|---------------------|--------------| > | Moshi | 60.8 / 54.5 | 23.4 / 22.1 | 25.6 / 16.7 | 1.84 / 1.76 | Full-Duplex | > | Freeze-Omni | 74.2 / 56.2 | 40.8 / 27.9 | 45.1 / 28.5 | 3.90 / 2.46 | Full-Duplex | > | SALMONN-omni | 79.3 / 73.6 | 49.7 / 43.7 | 63.6 / 56.0 | 4.01 / 3.22 | Full-Duplex | > | GLM-4-Voice | 75.0 / 65.7 | 38.5 / 37.0 | 50.8 / 47.5 | 3.82 / 3.58 | Half-Duplex | > > - Turn-taking prediction: 99.7% (Llama Q), 92.8% (TriviaQA), and 92.0% (AlpacaEval), outperforming all full- and half-duplex baselines. > > - Barge-in/Backchannel (Context-Independent): F1=0.88 (echo ×1.0), robust against system echo, with superior recall. > > - Context-Dependent, RL/DPO enhanced: F1 improved from 0.86 → 0.90, demonstrating substantial adaptation to nuanced dialog scenarios. > > - Emotion intensity score: 3.49, highest among open models. > > SALMONN-omni achieves ≥35.9% relative improvement over all open full-duplex systems under predicted (unassisted) turn-taking, and matches or outperforms turn-based systems trained on up to 13M hours of data, despite its substantially lower data requirement. > > ## 5. Advanced Conversational Behaviors: Barge-in, Turn-taking, Backchanneling, Echo Cancellation > > SALMONN-omni is capable of: > > - Simultaneous speech listening/speaking: Real-time handling of overlapping dialog enables barge-in interruption and context-aware dialog shifts not possible with half-duplex, modular, or codec-injected systems. > > - Echo cancellation: Self-speech present in microphone input is distinguished and discounted by the interleaved stream processing. > > - Backchanneling: Explicit token prediction enables the model to ignore minor user utterances (e.g., “uh-huh”) that should not interrupt the assistant's response. > > - General dialog flexibility: Autonomous state control supports dynamic, responsive conversation without reliance on external detectors or controllers. > > ## 6. Ablation Analysis and Policy Implications > > - Explicit state-control tokens provide robust and non-disruptive dialog transitions, outperforming implicit/latent state approaches. > > - Token diversity: Increased complexity in state-control token labeling (beyond <think>, <shift>) reduces dialog quality; low-entropy representations are optimal. > > - RL (DPO): First full-duplex LLM RL application, producing adaptive policies for interruption, turn-taking, and echo resilience. > > A plausible implication is that the LLM backbone is natively capable of learning multi-stream, temporal dialog phenomena, given appropriate autoregressive, stateful supervision. > > ## 7. Impact and Deployment Considerations > > SALMONN-omni represents a paradigm shift for spoken conversational AI: > > - Direct deployment as stand-alone assistants: No dependency on baroque multi-module pipelines or codec-tokenized cross-modal bridges. > > - Low latency, high resilience: Internal streaming and explicit state prediction yield fluid, responsive interaction even under challenging acoustic/overlap conditions. > > - Reduced data requirement: Substantially less training data than competitive systems; high generalizability from end-to-end, joint training. > > - Streamlined architecture: Embedding-based full-duplex, stateful mechanisms may generalize to multi-agent, non-speech modalities, forming the basis for future conversational models. > > ## 8. Conclusion > > SALMONN-omni establishes a new state-of-the-art for open, full-duplex speech LLMs by discarding codec injection and external module reliance, adopting explicit dynamic thinking state-control tokens, and leveraging end-to-end policy optimization. It excels at simultaneous listening/speaking, robust interruption and echo handling, context-dependent state transition, and expressive dialog, with superior open-bench scores. This suggests emerging capabilities for completely autonomous conversational agents across domains with substantial resource efficiency and deployment flexibility (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024).