FD-SLMs: Simultaneous Speech Interaction
- FD-SLMs are machine learning systems that enable continuous dialogue by concurrently processing and generating speech, mirroring human conversational dynamics.
- They employ advanced overlap handling, echo cancellation, and fine-grained turn-taking using both modular FSMs and end-to-end architectures for robust control.
- FD-SLMs enhance human-computer interaction by integrating natural backchannels and dynamic state transitions, outperforming traditional half-duplex systems in responsiveness and fluidity.
Full-Duplex Speech LLMs (FD-SLMs) are machine learning systems designed to enable simultaneous, low-latency spoken dialogue—allowing a single model to both listen to and emit speech concurrently, thus mirroring key aspects of natural human conversational dynamics such as overlapping speech, backchannels, context-dependent barge-in, and robust echo handling. Through recent architectural and algorithmic advances, FD-SLMs have established themselves as a foundational paradigm for human-computer interaction, outperforming modular, half-duplex predecessors in responsiveness, naturalness, and fluidity (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024, Chen et al., 18 Sep 2025).
1. Defining Full-Duplex Speech LLMs
FD-SLMs formalize dialogue as a continuous, synchronous mapping between a stream of incoming audio (user speech and environment) and an outgoing stream of generated audio (machine response). The system must concurrently estimate both and at each timestep , where is the system output and is the environmental input (Yu et al., 17 May 2025). Unlike turn-based (half-duplex) systems, where agents defer speaking until a segmented utterance has been processed via ASR, FD-SLMs operate in a joint generative loop with tightly-coupled perception and verbalization pipelines.
Key dynamics of FD-SLMs include:
- Overlap Handling: Emitting speech concurrently with input, allowing both interruption (barge-in) and conversational backchannels (“mm-hm”, “right”) (Cui et al., 10 Aug 2025, Chen et al., 18 Sep 2025).
- Echo Cancellation: Avoiding feedback artifacts by dynamically modeling and gating self-generated speech when emitted to the environment (Yu et al., 27 Nov 2024).
- Fine-grained Turn-Taking: Learning when to yield or resume the conversational floor based on semantic and prosodic context.
2. Architectures and Synchronization Strategies
2.1 Engineered Synchronization (Modular Architectures)
Modular approaches rely on explicit duplex control modules (finite-state machines or external controllers) which arbitrate “speak”, “listen”, or “idle” states and explicitly gate the LLM’s generative process. Example realizations: FlexDuo (plug-in controller decoupled from LLM) uses semantic integrity buffering and sliding-window state machines for filtering and interruption handling (Liao et al., 19 Feb 2025); Freeze-Omni and VITA-1.5 embed FSM or Voice Activity Detection (VAD)-driven arbitration as external mediators (Chen et al., 18 Sep 2025).
Table 1. Modular Synchronization Approaches
| Model | Arbitration | Data Flow |
|---|---|---|
| FlexDuo | 3-state FSM | ASR/NLU→FSM→LLM→TTS |
| Freeze-Omni | Internal tokens | LLM (with Speak/Listen tokens) |
| VITA-1.5 | External FSM | FSM over two LLMs & VAD |
FlexDuo and VITA-1.5 exhibit predictable but sometimes delayed switching; their explicit structure is easily extensible but incurs latency and can propagate errors from imperfect perception modules (Liao et al., 19 Feb 2025, Chen et al., 18 Sep 2025).
2.2 Learned Synchronization (End-to-End FD-SLMs)
End-to-end FD-SLMs internalize synchrony, jointly modeling user and agent streams via a single autoregressive backbone. Canonical instantiations:
- Codecs-in-Token-Space: Moshi, syncLLM, OmniFlatten tokenize audio via neural codecs (e.g., HuBERT, VQ-VAE) and interleave quantized speech tokens in an LLM (Chen et al., 18 Sep 2025, Zhang et al., 23 Oct 2024), but this introduces significant modality gaps and re-training burdens.
- Codec-Free Embeddings: SALMONN-omni discards audio tokenization, instead using continuous log-Mel/embedding streams with cross-modal attention and a learned “thinking” mechanism for state transition (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024).
- Next-Token-Pair Prediction: FLEXI and related work propose architectures where each step produces both the next output token and a control signal (e.g., “continue”, “yield”) via a single Transformer head, yielding lower latency and more precise arbitrations (Ge et al., 26 Sep 2025).
End-to-end FD-SLMs discover conversational behaviors—overlap, barge-in, backchannel—as emergent properties, but require careful handling of temporal alignment and often massive multi-modal datasets for effective training (Veluri et al., 23 Sep 2024, Chen et al., 18 Sep 2025).
3. Training, State Transition, and Dynamic Control
3.1 Dynamic State Selection
Modern FD-SLMs such as SALMONN-omni deploy explicit state-transitions as special “thinking” tokens (⟨think⟩, ⟨shift⟩, ⟨start_speak⟩, ⟨end_speak⟩) within their token streams. Probability of transitioning to speak, listen, or think is estimated via a learned Bernoulli (sigmoid over the LLM’s hidden state and acoustic embeddings). Supervision is applied via cross-entropy loss over both ordinary and state tokens (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024).
3.2 Control Tokenization and FSM Integration
Both engineered and hybrid models introduce control tokens or FSMs (e.g. [S.SPEAK], [C.LISTEN]) into the LLM vocabulary, sometimes informed through prompt engineering, instruction-tuning, or supervised annotation of dialogue states (Wang et al., 29 May 2024, Liao et al., 19 Feb 2025).
3.3 Reinforcement Learning for Turn Management
FD-SLMs applying reinforcement learning techniques (e.g., Direct Preference Optimization) further refine the timing of barge-in and backchannel behavior by maximizing reward for desirable real-time interruption handling (Yu et al., 17 May 2025).
4. Benchmarks and Evaluation
Recent work has established rigorous evaluation frameworks covering temporal, behavioral, semantic, and acoustic performance:
- Temporal Dynamics: Metrics include response latency (aiming for <200 ms), overlap ratio, and first-token offset (FTO) (Chen et al., 18 Sep 2025, Ge et al., 26 Sep 2025).
- Behavioral Arbitration: Assesses interruption response delay (IRD), barge-in success rate, and precise word error rate (WER) during arbitration (Lin et al., 30 Jul 2025, Ge et al., 26 Sep 2025).
- Semantic Coherence: Perplexity (PPL) and GPT-generated content scores capture the model’s ability to maintain meaningful dialogue (Chen et al., 18 Sep 2025, Zhang et al., 23 Oct 2024).
- Acoustic Quality: Human (MOS) and automatic (UTMOSv2) ratings measure perceived naturalness and intelligibility (Lin et al., 30 Jul 2025).
- Benchmark Suites: Full-Duplex-Bench v1.5/v2, MTR-DuplexBench, FLEXI—test overlap handling, multi-turn coherence, barge-in, safety, and turn-taking across both open-source and commercial LLMs (Zhang et al., 13 Nov 2025, Lin et al., 30 Jul 2025, Lin et al., 9 Oct 2025, Ge et al., 26 Sep 2025).
Table 2. Representative Full-Duplex Model Performance
| Model | FTO (s) | Barge-in F1 | MOS | Unique Features |
|---|---|---|---|---|
| SALMONN-omni | 0.38 | 0.88–0.93 | 3.85 | Standalone, codec-free, thinking tokens |
| Moshi | 2.22 | 0.80 | 3.90 | Codec injection, high data requirement |
| FlexDuo | — | — | — | Modular FSM with semantic buffering |
| Freeze-Omni | — | 0.68 | — | VAD-driven, two LLM processes |
SALMONN-omni and SYNC-LLM consistently deliver lower latency (FTO < 0.4 s), higher barge-in/backchannel F1, and competitive MOS using substantially less training data compared to prior systems (Yu et al., 17 May 2025, Veluri et al., 23 Sep 2024, Yu et al., 27 Nov 2024).
5. Empirical Findings, Limitations, and Comparative Analysis
Evaluations highlight common patterns:
- Error Cascade in Modular Systems: VAD and semantic integrity mismatches propagate interrupt errors and context pollution (Liao et al., 19 Feb 2025, Lin et al., 9 Oct 2025).
- Modality Gaps in Codec-Injection: Tokenizing audio as discrete codes requires large-scale speech-text alignment and can degrade intrinsic language ability (Yu et al., 17 May 2025, Chen et al., 18 Sep 2025).
- Responsiveness vs. Robustness Tradeoffs: Repair-first agents yield quickly in overlap settings but risk spurious interruption by noise or backchannels; continuity-first agents prioritize flow, risking delayed handover (Lin et al., 30 Jul 2025).
Benchmarks such as Full-Duplex-Bench and FLEXI reveal persistent weaknesses in multi-turn consistency, context drift, and handling of emergency or ambiguous overlap scenarios; even state-of-the-art systems lag behind human-level reactivity and semantic precision, especially under multi-round, noisy, or adversarial conditions (Zhang et al., 13 Nov 2025, Ge et al., 26 Sep 2025, Lin et al., 9 Oct 2025).
6. Open Challenges and Future Directions
Open research fronts include:
- Synchronous Data Scarcity: There remains a lack of large, real human-human full-duplex speech corpora with annotated overlaps, interruptions, and nuanced timing (Chen et al., 18 Sep 2025). Synthetic pipelines (TTS-driven, adversarially generated) are used to supplement these gaps but may not fully capture natural entrainment and topic flow.
- Multi-party and Multimodal Extension: Existing FD-SLMs mostly target dyadic (two-speaker) English conversation; scaling to multi-speaker (diarization), multi-modal (gestural, visual cues), and cross-lingual settings is largely unaddressed (Yu et al., 17 May 2025, Cui et al., 10 Aug 2025).
- Hierarchical and Adaptive Modeling: Prosody-aware objectives, explicit emotion and floor-control modeling, hierarchical reinforcement learning for complex goal management, and adaptive latency tuning present promising avenues for enhancing alignment with human discourse (Yu et al., 17 May 2025, Cui et al., 10 Aug 2025, Yu et al., 27 Nov 2024).
- Benchmark and Protocol Standardization: Continued development of open, streaming benchmarks, standardized task sets, and low-latency evaluation protocols (e.g., FDB v2, MTR-DuplexBench, FLEXI) are essential for reproducibility and fair comparison (Lin et al., 9 Oct 2025, Zhang et al., 13 Nov 2025, Ge et al., 26 Sep 2025).
7. Representative and Milestone Models
A non-exhaustive list of significant FD-SLMs and their core contributions:
- SALMONN-omni: First codec-free, standalone model with dynamic thinking state, achieves 30%+ relative improvement on dialogue and barge-in metrics over prior art (Yu et al., 17 May 2025, Yu et al., 27 Nov 2024).
- FlexDuo: Modular plug-in controller achieving ~25% reduction in false interruptions; demonstrates that explicit buffering and semantic integrity control can retrofit half-duplex models (Liao et al., 19 Feb 2025).
- OmniFlatten/FLM-Audio: Native full-duplex via "flattened" token streams or natural monologue dual-training; demonstrates lower preprocessing cost and strong language fidelity (Zhang et al., 23 Oct 2024, Yao et al., 2 Sep 2025).
- LSLM: Listening-While-Speaking fusion strategies (middle fusion) preserving TTS fidelity under simultaneous streaming inputs (Ma et al., 5 Aug 2024).
- Synchronous LLM: Explicit time embedding + scheduler produces human-like turn-taking, overlap, and backchannel patterns with minimal increases in generation latency (Veluri et al., 23 Sep 2024).
In sum, FD-SLMs represent a convergence of advanced speech and language modeling toward truly synchronous, contextually-aware, fluid spoken interaction, with current research emphasizing convergence of modular and end-to-end paradigms, robust low-latency control, and holistic, multi-dimensional evaluation (Yu et al., 17 May 2025, Chen et al., 18 Sep 2025, Yu et al., 27 Nov 2024).