Full-Duplex Spoken Dialogue Model
- Full-duplex spoken dialogue models are systems that process user and system speech concurrently, closely mimicking natural human interactions.
- They employ advanced architectures like end-to-end streaming and dynamic state control to manage overlapping speech, interruptions, and backchannels.
- These models ensure low-latency bidirectional exchanges, making them ideal for applications in customer service and interactive conversational agents.
A full-duplex spoken dialogue model is a computational system designed to engage in human-machine conversation by simultaneously listening and speaking, thereby more closely mirroring the natural characteristics of human interaction. Unlike traditional half-duplex (turn-based) systems, which alternate between receiving and generating speech, full-duplex models are engineered for real-time bidirectional exchange, accommodating overlapping speech, interruptions, and rapid backchannels.
1. Definition and Core Principles
Full-duplex spoken dialogue refers to the system's capacity to process incoming user speech and generate output speech concurrently. This design extends the telecommunication concept of full-duplex—bidirectional, simultaneous communication—to conversational AI, eliminating artificial turn-taking constraints and supporting natural dialogue constructs such as overlaps, interjections, and interruptions.
Key properties of a full-duplex dialogue model include:
- Simultaneous listening and speaking: The model processes input and produces output in real time on parallel channels.
- Responsive turn-taking: The system can handle user barge-ins, issue backchannel feedback, and dynamically cede or acquire the conversational floor.
- Low interaction latency: System response times must approach natural human conversational gaps (typically <500 ms), minimizing the temporal distinction between turns.
2. Representative Architectural Approaches
Architectural advances in full-duplex spoken dialogue modeling have transitioned from cascaded pipelines to end-to-end and streaming neural approaches.
Cascaded pipelines traditionally sequence modules for ASR, dialogue management, and TTS, coordinated through voice activity detection (VAD) and explicit turn-state FSMs. Limitations in these systems include high cumulative latency and diminished ability to model overlapping or ambiguous dialogue segments [WavChat (Ji et al., 15 Nov 2024)].
End-to-end models—exemplified by frameworks such as Moshi (Défossez et al., 17 Sep 2024) and SALMONN-omni (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024)—integrate perception, reasoning, and generation within a single neural model. Common design patterns include:
- Multi-stream processing: Parallel channels for user and system audio streams are modeled jointly, capturing natural overlaps and interjections.
- Streaming encoders/decoders: Embedding extractors (e.g., HuBERT, Mamba, CosyVoice) process audio in fixed windows (e.g., 80–200 ms) for input and output, supporting causal, low-latency inference.
- Unified token or embedding spaces: Systems avoid intermediate text stages (codec-free or token-level modeling) to preserve paralinguistic information and lower response delay.
- Internal state control: Explicit or implicit state tokens (e.g.,
>
,<shift>
, start/stop-speak) are autoregressively predicted to regulate transitions between listening and speaking (Yu et al., 17 May 2025).3. Dialogue State Modeling and Turn Management
Realistic full-duplex dialogue requires fine-grained state management, surpassing basic listen/speak alternation. Approaches include:
Finite State Machines (FSMs): FSMs coordinate system states such as SPEAK, LISTEN, and often an explicit IDLE state to handle noise, silence, or backchannels [FlexDuo (Liao et al., 19 Feb 2025)].
- Control Tokens and Semantic VAD: LLM-based dialogue managers can use control tokens (e.g., <|S-S|>, <|C-S|>, <|C-L|>, <|S-L|>) to semantically distinguish complete queries, real vs. spurious barge-ins, and to robustly manage turn switching (Zhang et al., 19 Feb 2025).
- Dynamic thinking mechanisms: Codec-free models (e.g., SALMONN-omni) incorporate "thinking" tokens, allowing the LLM to signal internal deliberation without outputting speech, thus accurately reflecting hesitation and reasoned response timing (Yu et al., 27 Nov 2024, Yu et al., 17 May 2025).
State transition logic is typically handled as a next-token or action prediction task, with supervised fine-tuning on annotated full-duplex conversations or, increasingly, with RL-based policy optimization to fine-tune turn-taking and interruption management.
4. Data, Training Paradigms, and Modeling Strategies
Data Requirements and Preparation
- Dual-channel and stereo datasets (e.g., Fisher corpus, custom synthetic corpora) are fundamental, as they provide precise alignment and labeling of user/system turns, overlaps, and interjections (Wang et al., 1 Jun 2025).
- Synthetic augmentation: Multi-stream TTS, as well as LLM-driven rewriting, is employed to expand coverage, simulate interruptions, and mimic spoken linguistic styles in low-resource languages (Japanese, Taiwanese Mandarin) (Ohashi et al., 3 Jun 2025, Yang et al., 11 Nov 2024).
- Time alignment: Models often align speech and text tokens at fixed steps (e.g., 12.5 Hz for Moshi, 80–200 ms for block-based models), managing variable length and asynchronous speech via padding or explicit marker tokens.
Training Objectives
- Multi-task cross-entropy losses over speech units, text tokens, and state markers are standard. Weighting is often used to prioritize semantic or conversational structure over low-level acoustics.
- Joint token/channel modeling: Novel paradigms such as next-token-pair prediction (NTPP) jointly predict next tokens for both user and agent channels at each timestep, establishing speaker independence and natural conversational dynamics (Wang et al., 1 Jun 2025).
- Semi-supervised and data augmentation strategies are used to exploit unlabeled data, increase generalizability, and improve robustness to domain or environmental variation (Lin et al., 2022).
5. Evaluation Metrics, Benchmarks, and Comparative Performance
Evaluation encompasses both objective and subjective criteria, addressing the unique requirements of real-time, interactive spoken dialogue:
- Turn-taking performance: Metrics include First Token Emission Delay (FTED), turn-switching F1, and takeover rate (TOR) during natural and pause-containing turns [Full-Duplex-Bench (Lin et al., 6 Mar 2025)].
- Backchannel and interruption handling: Frequency, latency, and appropriateness of backchannels; barge-in success rate and latency; and model’s ability to distinguish intentional vs. unintentional interruptions.
- Dialogue quality: Perplexity of transcribed outputs (relative to text baselines), conditionally on ground-truth or reference human performance (Veluri et al., 23 Sep 2024, Ohashi et al., 3 Jun 2025).
- Naturalness and human-likeness: Human Mean Opinion Scores (MOS), dialogue meaningfulness, and human evaluations of fluency and overlap dynamics.
- Automatic benchmarking frameworks: Full-Duplex-Bench is an example of a dedicated evaluation suite for pause, backchannel, turn-taking, and interruption measurement using automatic and LLM-based grading (Lin et al., 6 Mar 2025).
- Latency: Measured both as average response time and as the proportion of responses under fixed thresholds (e.g., 500 ms), with top-performing models achieving human-comparable responsiveness (<250 ms in Moshi (Défossez et al., 17 Sep 2024), <1 second in efficient S2S models (Hu et al., 21 May 2025)).
6. Deployment Considerations and Real-World Impact
Full-duplex spoken dialogue systems have been adopted in real-world applications, including:
- Commercial deployments: Alibaba’s customer service integrated a full-duplex multimodal SDS, reporting substantial latency reductions and operational robustness (Lin et al., 2022).
- Open-source toolkits and demos: Frameworks such as ESPnet-SDS (Arora et al., 11 Mar 2025) provide comparative evaluations of cascaded and end-to-end models, supporting community benchmarking and further research.
- Hardware and system integration: Preferred architectures are streaming and lightweight, supporting deployment in resource-constrained or real-time environments, with minimal reliance on complex middleware or multi-stage inference (Lin et al., 2022, Défossez et al., 17 Sep 2024, Hu et al., 21 May 2025).
7. Limitations, Challenges, and Future Directions
Despite rapid advances, several technical challenges persist:
- Data scarcity: High-quality, dual-channel, fully annotated open conversational datasets remain limited, especially for low-resource languages (Yang et al., 11 Nov 2024, Ohashi et al., 3 Jun 2025).
- Tokenization and alignment: Language-specific writing systems can lead to extreme sparsity in semantic token space (e.g., Japanese), complicating modeling at the token level.
- Latency-accuracy trade-offs: Smaller block sizes improve responsiveness but may degrade turn or context modeling (Défossez et al., 17 Sep 2024) [FlexDuo (Liao et al., 19 Feb 2025)].
- Robustness: Handling of overlapping speech, ambiguous interjections, and noisy background audio is not fully solved in all models, particularly in open-domain settings (Zhang et al., 23 Oct 2024, Wang et al., 1 Jun 2025).
- Modality scaling: Native support for more than three modalities (audio, vision, text, action) remains a frontier, with architectural and training innovations required for omnimodal, embodied interactive agents (Yao et al., 2 Jun 2025).
- Unified benchmarking: Lack of standardized, fine-grained metrics and evaluation suites slowed comparative assessment; recent benchmark releases aim to resolve this (Lin et al., 6 Mar 2025).
Future research will likely prioritize more robust multimodal integration, reinforcement learning for state control, low-resource and multilingual adaptation, and real-world deployment studies bridging fine-grained evaluation with interactive system performance.
Table: Selected Open-Source Full-Duplex Models and Core Features
Model/Framework Core Methodology Key Feature Moshi (Défossez et al., 17 Sep 2024) Dual-stream, codec tokens Joint modeling, real-time S2S SALMONN-omni (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024) Codec-free, internal "thinking" Single LLM, explicit state control Freeze-Omni (Wang et al., 1 Nov 2024) Frozen LLM, chunked I/O Low-latency, avoids catastrophic forgetting FlexDuo (Liao et al., 19 Feb 2025) FSM w/Idle, plug-in design Decouples duplexity, noise filtering RoboEgo (Yao et al., 2 Jun 2025) Omnimodal, parallel streams Vision/audio/text/actions, 80ms latency NTPP (Wang et al., 1 Jun 2025) Next-token-pair, 2-channel Speaker-agnostic joint modeling
Full-duplex spoken dialogue models represent the emerging standard for natural, robust, and responsive conversational AI, spanning foundational principles in time-synchronous neural modeling, advanced state and signal control, and multimodal integration. Their ongoing advancement is tightly coupled to data availability, systemic evaluation, and practical deployment in dynamic, real-world settings.