Full-Duplex Speech Dialogue Systems
- Full-duplex SDS are conversational agents that enable simultaneous listening and speaking, bypassing traditional turn-based communication.
- They combine continuous ASR, LLM-driven next-token prediction, and streaming TTS with neural finite state machines for dynamic turn control.
- Benchmark results show significantly reduced response delays and improved interruption precision, enhancing real-time interaction quality.
Full-duplex Speech Dialogue Systems (Full-Duplex SDS) are conversational agents capable of simultaneous real-time bidirectional communication—listening and speaking at once. Unlike half-duplex systems that alternate strictly between user and system turns, full-duplex SDS eliminate pipeline stalls by interleaving continuous Automatic Speech Recognition (ASR), LLM token prediction, Text-to-Speech (TTS) synthesis, and hierarchical control logic. This architecture enables conversational fluidity with low response latency and human-like behaviors such as interruption management, backchannels, and overlapping speech. Recent developments utilize streaming ASR, control-token-driven neural finite state machines (FSMs), end-to-end joint next-token prediction, and advanced benchmarking protocols to achieve robust interaction dynamics and scalable evaluation (Wang et al., 29 May 2024, Lin et al., 6 Mar 2025, Lin et al., 30 Jul 2025, Zhang et al., 19 Feb 2025, Peng et al., 25 Jul 2025).
1. Architectural Principles and Core Modules
Modern full-duplex SDS implementations comprise three tightly coupled streaming components governed by a central LLM:
- Perception Module: Streaming ASR segments incoming audio into fixed-length frames (e.g., 640 ms) and produces user token chunks. Chunks are immediately appended to the LLM input if the neural FSM is in the LISTEN state; otherwise, silence frames are dropped in SPEAK.
- Neural FSM for Turn-State Control: A compact FSM governs the dialogue flow, with two principal states—LISTEN and SPEAK. State transitions are triggered by LLM-emitted control tokens such as S.SPEAK (start/interrupt), C.LISTEN (continue listening), C.SPEAK (continue speaking), and S.LISTEN (yield/listen) (Wang et al., 29 May 2024). Extensions include multi-party modeling (per-speaker states) and additional states (e.g., PAUSE, THINK).
- Motor Function Module: Streaming TTS converts system tokens to audio. The module signals to the LLM when playback of each token completes, coordinating immediate emission upon becoming active.
All modules operate within a unified next-token prediction loop, updating at each event (new ASR chunk, TTS completion, control-token emission), enabling the agent to anticipate, yield, or override user speech autonomously.
2. Next-Token Prediction, Control Flow, and FSM Formalization
The synchronous operation is realized through real-time next-token prediction over a serialized dialogue tape:
- The LLM conditions on dialogue history (), recent ASR tokens (), and motor function state (), sampling either content or control tokens:
- The FSM performs formal state transitions:
with transition rules such as LISTENSPEAK, SPEAKLISTEN.
- Pseudocode for the main interactive loop:
1 2 3 4 5 6 7 8 9 10 11
tape ← system prompt s ← LISTEN while True: event = wait_for_event() tape.append(new_tokens) next_token = LLM(tape) if next_token in control_tokens: s = FSM_transition(s, next_token) elif s == SPEAK: TTS.play(next_token) tape.append(next_token)
- LLM-internal instruction tuning (typically on 1.5K+ synthetic transcripts with marked controls) yields robust handling of pauses, interruptions, and hand-offs (Wang et al., 29 May 2024).
3. Real-Time Metrics and Empirical Performance
Full-duplex SDS systems are quantitatively assessed using metrics standardized in recent benchmarks:
- First-Token Emission Delay (FTED): Response latency from user turn-end (or mid-sentence interruption) to system first token. LLM-based full-duplex SDS achieve:
- Baseline half-duplex: 2.28 s
- Streaming ASR + LLM-fd + streaming TTS: 0.68 s
- of responses within 500 ms; 90th percentile 1.6 s
- Interruption Precision Rate (IPR): Ratio of system interruptions at semantically appropriate mid-sentence points; Llama-3-8B-fd reaches 79.1% (8% higher than the best commercial LLM).
- Benchmarks such as Full-Duplex-Bench and FD-Bench define scenario-specific metrics including Takeover Rate (TOR), Backchannel Frequency, Jensen–Shannon Divergence (JSD) on backchannel timing, Success-Reply Rate (SRR), and robust handling under varying noise and interruption conditions (Lin et al., 6 Mar 2025, Peng et al., 25 Jul 2025).
Quantitative results demonstrate over 3x reduction in response latency (2.28 s 0.68 s), high success rates in interruption handling and turn-taking, and substantial improvements in reply quality and conditional perplexity.
4. Comparison with Alternative and Modular Architectures
Compositional and plug-and-play full-duplex control modules, such as FlexDuo, decouple FSM logic from the core LLM pipeline. FlexDuo introduces a third Idle state and semantic integrity–based buffering for noise filtering and mutual interruption reduction (Liao et al., 19 Feb 2025). It operates outside standard ASRLLMTTS cascades, emitting control signals and filtered audio for turn-taking without retraining the speech or dialogue models.
Table: Quantitative impact of FlexDuo on Fisher corpus (English baseline)
| System | Combined Turn-taking | False Interruption | Conditional PPL |
|---|---|---|---|
| VAD baseline | 0.81 | 0.53 | 64.32 |
| FlexDuo | 0.79 | 0.30 | 28.94 |
Removing the Idle state increases false interruptions () and degrades turn-taking (), affirming explicit filtering’s importance.
5. Extensions to Multilingual, Multi-Party, and Specialized Domains
Full-duplex SDS have been successfully adapted to Japanese conversational modeling by transferring architectures such as Moshi and applying stagewise pre-training, stereo fine-tuning, and synthetic dialogue augmentation (Ohashi et al., 3 Jun 2025). J-Moshi achieves improved perplexity and naturalness over the dGSLM baseline and more realistic overlap, mirroring Japanese conversational patterns.
FSM transitions and control-token spaces are generalizable for N-party interactions (LISTEN/SPEAK, per-speaker control tokens), with added states for THINK, PAUSE, and visual/gestural signals. Multi-modal extensions envisage gaze, facial cues, or context derived from third-party ambient speech (Wang et al., 29 May 2024, Liao et al., 19 Feb 2025).
6. Benchmarking, Scenario Coverage, and Future Challenges
A rich ecosystem of benchmarks (Full-Duplex-Bench, FD-Bench, Full-Duplex-Bench v1.5/v2, FLEXI, MTR-DuplexBench, FDB-v2) systematizes evaluation via scenario-driven tests, multi-turn dynamics, interruption robustness, and modular protocol design (Lin et al., 6 Mar 2025, Peng et al., 25 Jul 2025, Lin et al., 30 Jul 2025, Ge et al., 26 Sep 2025, Zhang et al., 13 Nov 2025, Lin et al., 9 Oct 2025). Metrics include:
- Fluency, instruction following, task-specific competence (1–5 LLM-assigned scores)
- Success rates, latency, backchannel and overlap handling
- Multi-round degradation in feature success and instruction following
- Safety/refusal rates across adversarial prompts
Empirical findings highlight outstanding challenges: blurred turn boundaries, context drift, latency spikes as conversations progress, and systematic trade-offs between latency, conversational intelligence, and robustness to noise and interruption.
7. Future Directions and Research Recommendations
Research priorities identified include:
- End-to-end architectures employing next-token-pair prediction for joint listening and speaking
- Streaming semantic endpoint detection (e.g., Phoenix-VAD), modular and independently optimizable
- Hierarchical memory, explicit planning guidance (TurnGuide), and token-level safety/instruction filters
- Streaming and chunked decoding with roll-back capabilities
- Scalable adaptation to multi-party, cross-lingual, and multimodal conversational domains
The consensus across contemporary full-duplex SDS research is that embedding listening, turn-taking, interruption detection, and backchanneling within a unified end-to-end joint token prediction paradigm yields the lowest latency and highest naturalness (Ge et al., 26 Sep 2025, Wang et al., 29 May 2024). The field is rapidly transitioning toward open, extensible benchmarks and modular designs to advance robust, context-aware human-machine interaction.