Full-Duplex Speech LLMs

Updated 2 November 2025

Full-duplex Speech LLMs are advanced large language models that support simultaneous speech comprehension and production, enabling overlapping dialogue and dynamic turn-taking.
Architectural innovations include codec-free embeddings, explicit state tokens, and unified processing of speech and text streams to enhance responsiveness.
Rigorous evaluation protocols using metrics such as response latency and barge-in success validate these systems’ real-world performance and address synchronization challenges.

Full-duplex Speech LLMs are a class of LLMs architected to enable natural, real-time, bidirectional spoken dialogue. In contrast to half-duplex or turn-based conversational systems, full-duplex models support simultaneous speech production and comprehension, allowing for overlapping talk, user barge-in, rapid turn-switching, and handling of conversational phenomena such as backchannels and interruptions. Advances in this area have required the development of novel architectural paradigms, state or control mechanisms, codec-free embedding flows, and rigorous evaluation protocols, reflecting a shift from modular pipelines to unified, end-to-end, multimodal LLMs.

1. Architectural Paradigms: Modular, Codec-injected, Codec-free, and Unified Embedding Designs

Full-duplex Speech LLMs exhibit architecturally distinct lineages:

Engineered Synchronization (Modular/Plug-in Control): Systems such as FlexDuo (Liao et al., 19 Feb 2025) introduce a decoupled, plug-and-play full-duplex control module that can graft onto any half-duplex LLM-based spoken dialogue system (SDS). A finite-state machine (FSM) models seven dialogue strategies, including an explicit “Idle” state for filtering non-target audio, supporting independent module optimization and robust context filtering.
Codec-injected Unified LLMs: Open models like those in "Efficient and Direct Duplex Modeling" (Hu et al., 21 May 2025), Moshi, and SyncLLM inject discretized audio tokens, typically derived from neural codec models, directly into (or as output from) a large text LLM’s vocabulary. Separate pathways process user speech input (via continuous streaming embeddings) and agent output (via neural audio codec tokens), enabling channel fusion and parallel prediction. This yields a multi-channel, next-token prediction schema, with the loss function:

$\mathcal{L} = \lambda_\text{text}\,\mathcal{L}_\text{text} + \lambda_\text{speech}\,\mathcal{L}_\text{speech}$

Flexible data alignment and omission of speech pretraining (due to powerful pretrained streaming encoders) enable resource-efficient scaling.

Codec-free End-to-End Models: The SALMONN-omni series (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024) departs decisively from codec-injected schemes, employing streaming speech encoders and synthesizers interfacing with a single LLM backbone via high-dimensional continuous embeddings, without quantization to discrete audio tokens. All input/output modalities traverse the LLM as embedded representations, with synchronized blocks for streaming operation. Turn-taking and mode transitions are regulated by explicit "> " or state tokens output by the LLM, directly enabling seamless and expressive conversational control in full duplex. > > 4. Flattened or Serialized Sequences: Models like OmniFlatten (Zhang et al., 23 Oct 2024) convert multi-stream (user/assistant, speech/text) data into a single token sequence via a “flattening” operation. This allows a conventional GPT transformer to learn complex full-duplex conversational patterns (overlap, interruption) using chunked, interleaved input-output data, with all modalities unified for end-to-end training. > > ## 2. Duplex Control: State Management and Dynamic Thinking > > Full-duplex dialogue requires managing the conversational state, preventing premature responses, and enabling responsive interruption. Two key strategies emerge: > > - Explicit State Machines and Control Tokens: > > FSMs as in (Wang et al., 29 May 2024, Liao et al., 19 Feb 2025) define explicit transitions (e.g., SPEAK→LISTEN, KEEP LISTENING, etc.), with the LLM predicting state transitions as part of its next-token sequence. Control tokens may explicitly signal speaking/listening/conceding/interruption, with fine-tuning for robust response to ambiguous pauses and barge-in. > > - Dynamic Thinking Mechanisms: > > SALMONN-omni (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024) adopts a “thinking” mechanism, where tokens such as <think> or <shift> are generated by the LLM to encode internal cognitive states like listening, planning, and transitioning to speech output. This enables asynchronous speech-text generation and tight integration with streaming input, allowing fine-grained control over conversational dynamics even in overlapping or noisy conditions. > > ## 3. Streaming and Synchronization: Time-aware Modeling > > True conversational fluidity requires real-time synchrony between input and output streams: > > - Periodic Blocking and Chunked Processing: > > Models process speech in fixed-size blocks or chunks (e.g., 80–500 ms), with synchronized processing of audio and dialogue events. In SyncLLM (Veluri et al., 23 Sep 2024), synchronization tokens (e.g., [S0], [S1]) provide clock ticks, enabling the LLM to align token generation with actual audio timelines for both participants. Deduplication and interpolation mechanisms handle the temporal mapping of speech/non-speech for efficient sequence modeling. > > - Turn-level and Interleaving Guidance: > > Planning-inspired mechanisms like TurnGuide (Cui et al., 10 Aug 2025) segment dialogue into turns, generate planned text “guide” segments prior to speech output, and interleave text guidance with speech chunks for fluent, semantically meaningful interactions. Interleaving can be both channel-wise (user/assistant) and modality-wise (text/speech), strictly preserving temporal and content alignment. > > ## 4. Robust Control of Interruptions, Overlaps, Backchannels, and Echo > > Handling key human conversational behaviors is central: > > - Intention-aware Barge-in: > > Modular DMs based on LLMs (Zhang et al., 19 Feb 2025) distinguish intentional vs. unintentional user barge-ins by predicting specialized control tokens, ensuring that interruptions redirect or suppress output appropriately, while backchannels are ignored or allowed to pass. > > - Contextual Filtering and Buffering: > > Idle states and semantic context buffering (Liao et al., 19 Feb 2025) prevent irrelevant or noisy audio from polluting conversation context, reducing false triggerings and managing interruption transitions. > > - Echo and Overlap Handling: > > Codec-free models (Yu et al., 27 Nov 2024) and those with self-conditioning ensure the LLM can distinguish its own voice echo in the input stream to avoid self-triggered interruptions, enabling robust full-duplex operation even in non-ideal acoustic environments. > > ## 5. Evaluation Benchmarks and Protocols > > Rigorous, real-time, and multi-turn benchmarks are now critical: > > - Scenario-based Frameworks: > > FLEXI (Ge et al., 26 Sep 2025) evaluates full-duplex LLMs across six human-LLM interaction scenarios: turn-taking, pause handling, user interrupt, model interrupt, backchannels. Metrics—such as Takeover Rate (TOR), Turn-Termination Rate (TTR), Jump-in Rate (JIR), Emergency Detection Score (EDS), and Jensen-Shannon Divergence (JSD) for backchannel alignment—allow granular assessment of real-world dialogue competencies. > > - Automated, Multi-turn Streaming Evaluation: > > Full-Duplex-Bench-v2 (Lin et al., 9 Oct 2025) employs automated examiner agents and LLM-judged scoring for turn-taking fluency, instruction following, and scenario-specific competence across fast and slow pacing. > > - Robustness, Latency, and Task Metrics: > > Metrics such as Interrupt-Response Delay (IRD), Success-Reply/Interrupt Rate, Word Error Rate (WER), Conditioned Perplexity (C-PPL), and task-specific evaluation (safety, correction, entity tracking) surface strengths and deficiencies that were previously hidden in turn-based evaluation regimes. > > ## 6. Data, Multi-talker, and Specialized Domains > > High-fidelity full-duplex performance depends on rich data and robust handling of domains: > > - Synchronous, Dual-channel Training Data: > > The scarcity of spontaneous, multi-speaker, overlapping speech corpora is a bottleneck (Chen et al., 18 Sep 2025). Data pipelines for dual-track dialogue (DialoSpeech (Xie et al., 9 Oct 2025)), synthetic augmentation (Veluri et al., 23 Sep 2024), and real-world healthcare dialogue with fine-grained annotation (MMedFD (Chen et al., 24 Sep 2025)) improve coverage and robustness. > > - Multi-talker and Attribute-aware ASR: > > MT-LLM (Meng et al., 13 Sep 2024) demonstrates how LLMs, with dual speech encoders (Whisper, WavLM) and LoRA adapters, can execute instruction-driven multi-talker ASR, speaker attribution, and context-sensitive transcription—enabling full-duplex assistants even in cocktail party scenarios. > > ## 7. Limitations, Performance, and Toward Seamless Conversational AI > > Despite substantial advances, current open-source models yield mixed results under rigorous evaluation: > > - Latency and Interruption: > > Full-duplex LLMs now achieve response latencies as low as 0.68s (full streaming, (Wang et al., 29 May 2024)), and barge-in success rates exceeding 90% (Hu et al., 21 May 2025), matching or exceeding leading commercial systems in interruption precision. > > - Semantic/Conversational Quality: > > Codec-free and planning-guided approaches (SALMONN-omni (Yu et al., 27 Nov 2024), TurnGuide (Cui et al., 10 Aug 2025)) recover much of the semantic coherence lost in naive speech-to-speech models, with significant gains in GPT-based reasoning and conversational scores. > > - Challenges: > > Evaluations reveal persistent challenges with backchannel timing, cross-turn correction, reference management, and emergency detection (Ge et al., 26 Sep 2025, Lin et al., 9 Oct 2025). Benchmarks underline the gap to human-level dialogue fluidity and the necessity for comprehensive next-token pair prediction or dual-stream autoregressive modeling. > > ### Summary Table: Full-Duplex Speech LLM System Properties > > | System/Model | Duplex Mechanism, Synchronize | Codecs | Turn-taking Control | Evaluation | > |------------------------|------------------------------|--------|----------------------------|--------------------| > | FlexDuo | FSM w/Idle, modular plug | N/A | 7-state FSM | F1, cond. PPL | > | Efficient Duplex S2S | Channel Fusion, codec tokens | Yes | End-to-end, fusion | Barge-in, UTMOS | > | SALMONN-omni | Codec-free, dynamic thinking | No | State tokens, embeddings | SQA, barge-in, RL | > | Freeze-Omni | Duplex multi-task, frozen LLM| Yes | VAD + state prediction | SQA, CER/WER | > | OmniFlatten | Sequence flattening | Yes | Chunked, flatten | LLM score, latency | > | Benchmarks (FLEXI, FD-Bench, FDB-v2) | N/A | N/A | N/A | Scenario-based, LLM| > > In summary, full-duplex Speech LLMs represent a convergence of advanced streaming architectures, stateful control protocols, end-to-end multimodal training, and rigorous, task-centric evaluation. Key innovations include codec-free joint embedding flows, dynamic state modeling within the autoregressive token space, and plug-and-play duplex control that can scale across diverse tasks and real-time domains. As benchmarks and open-source models mature, the focus is shifting from raw speech-text alignment toward naturalistic, interruption-tolerant, semantically robust human-computer conversation.