Full-Duplex Speech LLMs
- Full-duplex speech LLMs are dialogue systems that enable synchronous speaking, listening, and interruption using integrated ASR, TTS, and neural finite state machines.
- Recent advances achieve subsecond latency and high interruption precision by employing unified autoregressive Transformers and next-token pair prediction.
- Benchmarks like MTR-DuplexBench and FLEXI validate improvements in latency and naturalness while highlighting challenges in context drift and noise robustness.
Full-duplex speech LLMs (FD-SLMs) define a class of dialogue agents in which both user and machine can speak, listen, and interrupt each other in true real time, closely mimicking human-human conversational synchronicity. Unlike half-duplex systems constrained by round-based turn-taking protocols, FD-SLMs employ tightly coupled sensory (ASR), motor (TTS), and neural state control mechanisms—often integrating these into unified autoregressive Transformers—to achieve seamless fluid interaction. State-of-the-art schemes implement synchronous full-duplex by combining high-throughput streaming ASR, low-latency TTS, and explicit next-token prediction of both content and control signals within a neural finite state machine (FSM), yielding substantial improvements in latency, interruption precision, and conversational naturalness compared to traditional pipelines (Wang et al., 29 May 2024). Benchmarks such as MTR-DuplexBench and FLEXI demonstrate that FD-SLMs enable overlapping speech, timely barge-in, dynamic turn arbitration, and continuous dialogue evaluation, but also reveal persistent challenges such as semantic drift, context maintenance over multiple rounds, and breakdowns under noise or boundary ambiguities (Zhang et al., 13 Nov 2025, Ge et al., 26 Sep 2025). Recent advances emphasize end-to-end architectures, codec-free modalities, modular control layers, and specialized semantic event detectors, pointing toward an increasingly robust, low-latency, and cognitively flexible generation paradigm for spoken LLMs.
1. System Architectures for Full-Duplex Speech LLMs
FD-SLMs are typically composed of three primary functional modules: a streaming perception module (ASR), a streaming motor function module (TTS), and a LLM tightly aligned to a control FSM. In the earliest integrated scheme (Wang et al., 29 May 2024), the perception module continuously segments incoming speech, rapidly transcribes each chunk (e.g., every 640 ms), and feeds tokens to the LLM regardless of output state. The motor function module is activated by non-control output tokens emitted by the LLM, enabling real-time spoken response while concurrently synchronizing spoken token completion acknowledgements. The control layer is realized as a two-state FSM (Listening , Speaking ), with transitions driven by LLM-generated control tokens—initiate speaking, interrupt, or wait. All content and control decisions are unified into a serialized next-token prediction process, offering temporal integration and multi-millisecond reaction latency.
Alternatives incorporate modular controllers external to the LLM—such as FlexDuo's ternary state manager with explicit Idle, Listen, and Speak actions decoupled from the core dialogue engine for enhanced filtering and semantic robustness (Liao et al., 19 Feb 2025). Advanced FD-SLMs (e.g. SALMONN-omni) eliminate speech codecs from the token space and rely entirely on continuous streaming embeddings, with the LLM itself mediating state via "thinking" and transition tokens (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024).
2. Neural FSMs and Full-Duplex Token Pipelines
Central to FD-SLM architectures is the FSM-based full-duplex pipeline. The neural FSM comprises and control event set , with transitions
At each LLM decoding step, one of three token types is predicted: ASR transcript (external input), content (what to speak), or control (FSM event). The token tape grows as the agent listens and speaks, enabling tight coupling between user and model turns. In practical implementation, every new ASR chunk, control event, or TTS completion can trigger a new prediction step, allowing the agent to respond, hold, yield, or interrupt with subsecond latency.
Recent advances formalize full-duplex as next token-pair prediction (NTPP) (Ge et al., 26 Sep 2025), generating both the next dialogue token and a pause/don't-pause control bit conditioned on interleaved streams. This approach achieves native overlap handling and continuous encoding of incoming audio even while emitting output, reducing two-pass delays and minimizing semantic drift.
3. Evaluation Benchmarks and Metrics
Comprehensive evaluation of FD-SLMs increasingly relies on benchmarks tailored to overlapping, multi-round, full-duplex interaction. MTR-DuplexBench segments continuous dual-channel dialogues into discrete turns via multi-algorithm clustering and majority voting, enabling per-turn assessment of quality, dynamics, instruction following, and safety (Zhang et al., 13 Nov 2025). Metrics include GPT-score (0–5 coherence rating per turn), success rates for interruption, pause, backchannel handling, response latency (), and refusal rate in response to adversarial prompts.
FLEXI explicitly evaluates model interruption in emergency scenarios, turn-arbitration metrics (TOR, TTR, jump-in rate), conversational latency (<400 ms as target), semantic similarity to reference, backchannel rates, and Jensen–Shannon divergence for temporal alignment (Ge et al., 26 Sep 2025). FD-Bench introduces interruption-handling rates and robust, time-aware measures of false interrupts under simulated noisy and noisy-interrupt conditions (Peng et al., 25 Jul 2025).
Key empirical findings show FD-SLMs can achieve up to threefold reductions in average latency (down to 0.68 s), >50% of tokens generated within 0.5 s post-query, and measurable superiority in interruption precision to commercial models (e.g. GPT-4o), but degradation in dialogue quality, consistency, and latency over successive rounds remains a challenge.
| Benchmark | Dialogue Quality (GPT-score) | Interruption Rate (%) | Latency (s) |
|---|---|---|---|
| MTR-DuplexBench | 1.94 (Moshi, Candor) | 54.7 (Moshi) | 0.68 (FD-SLM) |
| FLEXI | <1.0 EDS (open FDSLM) | 0.4–0.5 TOR | 0.696 (Moshi) |
| FD-Bench | 4.43 (Moshi, subjective) | 83.1 (Moshi, SIR) | 1.345 (Moshi, IRD) |
4. End-to-End Learning and Modularity
Recent FD-SLM advances emphasize end-to-end learning for both speech understanding and generation. Codec-based models are increasingly supplanted by architectures that operate over continuous embedding streams, entirely removing quantized audio tokens from the vocabulary. Models such as SALMONN-omni employ a streaming encoder (e.g., Mamba block), an LLM backbone with LoRA adapters, explicit "thinking" and "shift" transition tokens, and integrated streaming speech synthesis (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024). The model jointly consumes all environmental and echo embeddings in blocks (e.g. 80 ms) and internally arbitrates state transitions.
Plug-and-play modular control is achieved via explicit state prediction modules (FlexDuo, Phoenix-VAD, SemanticVAD), which act as lightweight neural controllers regulating dialogue state and event transitions, fully decoupled from the main LLM (Liao et al., 19 Feb 2025, Wu et al., 24 Sep 2025, Zhang et al., 19 Feb 2025). This modularity enables domain adaptation, rapid retraining, and independent optimization for latency and accuracy without the need to alter core LLM weights.
5. Challenges: Context, Latency, and Multi-Round Consistency
Key unsolved challenges persist:
- Context Drift and Boundary Ambiguity: FD-SLMs suffer from divergent context maintenance—they struggle to keep track of instruction-following or dialogue memory over multiple overlapping rounds, with latency and success rates degrading by up to 30% from the first to the tenth round (Zhang et al., 13 Nov 2025).
- Latent Turn Segmentation: Blurring between user and agent turns complicates real-time arbitration, with imperfect segmentation leading to incorrect state assignments or lost input.
- Latency Accumulation: Real-time operation adds processing delay per round; only streaming architectures with sub-chunk prediction achieve sustained low latency (e.g., Freeze-Omni median end-to-end latency ≈ 753 ms) (Wang et al., 1 Nov 2024).
- Robustness under Noise and Interrupts: Background noise and frequent interruptions degrade interruption response rates and dialogue quality, requiring robust front-end filtering and noise augmentation (Peng et al., 25 Jul 2025).
- Instruction Following and Safety: Multi-round settings lead to context loss and reduced instruction-following success, while safety refusal rates remain stable (≈90%) (Zhang et al., 13 Nov 2025).
Recommended strategies include improved learned turn segmentation, extended attention memories, hierarchical encoders, benchmarking multi-round latency, and expansion to multilingual and noisy environments.
6. Future Directions and Open Research Problems
Emerging trends and open challenges for FD-SLMs include:
- Unified End-to-End Architectures: Further integration is anticipated, where ASR, TTS, and FSM are subsumed within a multimodal LLM that directly models audio and content streams, with next token-pair prediction for synchronous control.
- Data Expansion and Multimodality: Extension to richer datasets covering multi-party, multilingual, noisy, and multimodal (vision, gesture) contexts is necessary to capture complex conversational phenomena (Cui et al., 10 Aug 2025).
- Reinforcement and Human-in-the-Loop Learning: Deployment of RL-based optimization (e.g., DPO for interruption/backchanneling) and A/B user studies to refine naturalness and prosody.
- Adaptive Turn Length and Prosodic Planning: Adaptive control over turn length and timing, integration of paralinguistic features, and planning-inspired text guidance have shown to markedly improve semantic coherence (Cui et al., 10 Aug 2025).
- Scalable Model Recipes: Recipes implementing frozen-backbone modularity, as in Freeze-Omni, allow rapid adaptation to future LLMs and avoid catastrophic forgetting (Wang et al., 1 Nov 2024).
In summary, FD-SLMs constitute a rapidly evolving paradigm for natural, synchronous spoken dialogue modeling—characterized by tightly aligned sensory and neural modules, end-to-end token prediction with embedded state control, formal benchmarks targeting multi-round and interruptive interaction, and persistent challenges in context drift, segmentation, and sustained latency. Future research is likely to consolidate modular control, streaming multimodal integration, and continuous evaluation methodologies to further close the gap between human-like interaction and machine-generated dialogue (Wang et al., 29 May 2024, Zhang et al., 13 Nov 2025, Ge et al., 26 Sep 2025, Cui et al., 10 Aug 2025, Wang et al., 1 Nov 2024, Liao et al., 19 Feb 2025, Peng et al., 25 Jul 2025).