Full-Duplex Interaction Track
- Full-duplex interaction track is a standardized framework defining evaluation protocols for SDS that support simultaneous speech processing with natural overlaps and backchannels.
- It leverages dual-channel datasets and precise metrics such as interruption detection rate, overlap handling accuracy, and latency measures to benchmark system performance.
- The track drives innovation by integrating open-source corpora, rigorous benchmark suites, and diverse synchronization strategies for natural, low-latency human-AI interactions.
Full-duplex interaction track encompasses standardized evaluation, modeling, and benchmarking protocols and datasets dedicated to the rigorous assessment and advancement of full-duplex spoken dialogue systems (SDS), where simultaneous system listening and speaking—supporting natural overlaps, interruptions, and backchannels—are the norm. The emergence of dedicated "interaction tracks" has been catalyzed by a wave of open-source datasets, new front-end architectures, and specialized benchmarks tailored for this regime. This article synthesizes the core definitions, historical context, evaluation methodologies, system architectures, benchmark results, and open challenges defining the Full-Duplex Interaction Track.
1. Concept and Motivation
In full-duplex interaction, a dialogue system must support fully concurrent input and output speech processing, enabling both parties to communicate simultaneously—mirroring natural human conversational behavior. This means that user and system may talk over each other (overlap), interrupt, or inject short feedback tokens (backchannels like "mm-hm") with minimal latency and no enforced alternation of turns. Traditional half-duplex (turn-based) architectures—waiting for explicit end-of-speech cues before formulating a response—exhibit high latency, fail to robustly handle user interruptions or backchannels, and degrade conversational naturalness. Full-duplex interaction thus represents a critical milestone for realistic human-AI spoken interaction (Wang et al., 23 Apr 2026, Chen et al., 18 Sep 2025, Lin et al., 30 Jul 2025).
A full-duplex spoken dialogue system must continuously analyze incoming streams (ASR, prosody, intent detection) while also actively generating output (TTS), dynamically negotiating turn-yielding, interruption, and overlap management in real time.
2. Datasets and Corpus Design
Evaluating and training full-duplex systems requires datasets that capture overlapping, natural, multi-turn dialogues with explicit time-aligned, dual-channel recordings.
A. Dual-Channel Human Dialogues The HumDial-FDBench corpus (Wang et al., 23 Apr 2026) exemplifies the standard: >100 hours of two-channel bilingual recordings with each interlocutor on a separate microphone to preserve true overlaps, interruptions, and rich prosodic cues. Annotations span speech activity boundaries, ASR transcripts (Paraformer, Parakeet-TDT), event labels (interrupt, overlap, backchannel), and turn-negotiation outcomes.
B. Annotated Open-Source Datasets Corpora such as the dual-track open-source datasets of (Zhou et al., 4 Sep 2025) deliver isolated-microphone conversational English and Chinese with precise overlap-rich time-aligned transcripts and paralinguistic annotation (laughter, backchannel, filled pause), enabling objective modeling and overlap-precision/recall evaluation.
C. Synthetic and Augmented Data Where human-collected corpora are scarce, synthetic pipelines (TTS-rendered, LLM-generated scripts) create dialogue with controlled interruption/overlap scenarios and staged difficulty to augment training and evaluation (Peng et al., 25 Jul 2025, Lin et al., 30 Jul 2025). Data augmentation via TTS, noise-mixing, and injection of silence or overlap intervals is now standard.
3. Benchmark Frameworks and Metrics
Formal benchmarks operationalize the Full-Duplex Interaction Track, providing unified pipelines, scenario definitions, and rigorous metric suites.
A. Full-Duplex-Bench Family
- FD-Bench (v1.5) (Lin et al., 30 Jul 2025, Peng et al., 25 Jul 2025): Modular automation with simulated user interruption, backchannel, side conversation, and background speech; streaming user audio to model APIs; objective and subjective scoring.
- Full-Duplex-Bench-v2 (FDB-v2) (Lin et al., 9 Oct 2025): Multi-turn scenarios (daily tasks, correction, entity tracking, safety) with an automated examiner LLM orchestrating staged semantic goal progression and real-time interruptions.
B. HumDial-FDBench (Wang et al., 23 Apr 2026):
Built on dual-channel natural conversation, it measures, for each scenario:
- Interruption Detection Rate (IDR): Proportion of user interruptions correctly recognized.
- Overlap Handling Accuracy (OHA): Correct system action during overlapping speech.
- Turn-Taking F1: Unified positive (interrupt/resume) accuracy.
- Latency metrics: Stop latency (time to halt after interruption), response latency, first-response time post-interruption.
- Naturalness/User satisfaction: MOS scales.
- Dialogue Completion Accuracy: End-task goal attainment.
C. Overlap, Timing, Prosody, and Quality Metrics (Lin et al., 30 Jul 2025):
- Behavioral: respond/resume/uncertain/unknown/silent labels per post-overlap trial.
- Timing: median stop/response latency.
- Prosodic adaptation: per-segment pitch, rate, intensity shift following overlap.
- Speech quality: predicted MOS (e.g., UTMOSv2).
D. Unified Metric Suite (Chen et al., 18 Sep 2025):
Four pillar structure: Temporal Dynamics (latency, overlap ratio), Behavioral Arbitration (interruption response, ISR), Semantic Coherence, Acoustic Performance.
Standardized benchmark configuration files (e.g., YAML/JSON scenario, streaming protocol) and adapter interfaces facilitate system extension and comparison.
4. System Architectures and Synchronization Strategies
The Full-Duplex Interaction Track encompasses diverse system designs, which fall broadly into two categories (Chen et al., 18 Sep 2025, Yang et al., 10 Mar 2026):
A. Engineered Synchronization (Modular)
- Separately engineered turn-taking control, interrupt/barge-in detectors, and TTS/ASR subsystems.
- External VADs and state machines arbitrate speech exchange.
- Example: FlexDuo, VITA-1.5 (stable latency, lower semantic coherence).
- Standard pipeline: streaming audio → ASR/VAD → Duplex FSM → LLM + streaming TTS.
B. Learned Synchronization (End-to-End)
- Unified model that jointly encodes incoming audio and decodes speech tokens in streaming mode, learning synchrony by modeling the joint probability of input and output sequences.
- Techniques: Next Token-Pair Prediction (NTPP) (Chen et al., 18 Sep 2025, Ge et al., 26 Sep 2025)), chunk-level synchronous prediction (Veluri et al., 2024), token or time-aligned multi-stream autoregression (Li et al., 21 Apr 2026, Yao et al., 2 Sep 2025).
- Example: Moshi, dGSLM, SyncLLM, FLM-Audio, UAF, DuplexCascade, SoulX-Duplug.
- Synchronization strategies range from micro-turn chunking (Yang et al., 10 Mar 2026) to unit-based gating (Yu et al., 28 Jan 2026) and full autoregressive front-end LLMs (Li et al., 21 Apr 2026).
C. Hybrid and Semi-Cascaded Designs Semi-cascaded frameworks (e.g., unit-based minimal units, plug-and-play state-prediction modules (Yu et al., 28 Jan 2026, Yan et al., 16 Mar 2026)) balance low-latency, robust interruption handling, and maintain modularity and adaptability with off-the-shelf VAD/ASR/TTS/LLM components.
5. Empirical Results and Failure Modes
A. Leaderboard Results The ICASSP 2026 HumDial Challenge (Wang et al., 23 Apr 2026) ranks open and closed systems on interruption, rejection, and latency. Cookie_asr (modular, acoustic classifier + LLM) achieved best balance (Int. 79.3 %, Rej. 72.2 %, Delay 1.26 s), while Badcat (unit-based, track-2) reached 89.7 % interruption and lowest latency among end-to-end models. Moshi and Freeze-Omni lagged, with slow response and low rejection accuracy.
B. Benchmark Insights
- Two dominant strategies: repair-first (rapid yield on interruption; e.g., Freeze-Omni, GPT-4o) vs. continuity-first (preserve output flow; e.g., Gemini, Sonic).
- Repair-first achieves low stop latency (0.18–0.23 s) at the cost of possibly over-triggering; continuity-first models sustain flow but risk ignoring critical user signals.
- All agents are 5–6× slower than the ∼200 ms human norm in re-entry after interruption.
- Overlap management remains challenging: models misclassify third-party/ambient speech, inconsistently respond to backchannels, or show degraded turn-tracking under noise (Lin et al., 30 Jul 2025, Ge et al., 26 Sep 2025).
C. Modeling and Robustness Unified audio front-end LLMs (e.g., UAF (Li et al., 21 Apr 2026)) and plug-and-play state predictors (SoulX-Duplug (Yan et al., 16 Mar 2026)) achieve barge-in/interrupt detection at 220 ms average latency, with 99.2 % real-world interruption success and superior robustness to SNR degradation and cross-talk compared to cascaded VAD+ASR+LLM pipelines.
D. Dataset-Driven Gains Dual-channel corpora (Zhou et al., 4 Sep 2025) and staged chunk-wise micro-turn pipelines (Yang et al., 10 Mar 2026) consistently improve naturalness, overlap robustness, and subjective MOS in real-world or VoiceBench/Full-DuplexBench evaluation.
6. Research Challenges and Future Directions
A. Data Scarcity and Annotation There remains a dearth of large-scale, naturally overlapped, multilingual, and richly annotated corpora. Manual annotation remains resource-intensive, especially for semantic overlap events (Chen et al., 18 Sep 2025, Wang et al., 23 Apr 2026). Synthetic augmentation (e.g., TTS-based corpus expansion) only partially addresses realistic turn dynamics and prosody.
B. Evaluation and Metric Standardization Unifying disparate metrics across benchmarks is an ongoing project: adoption of cross-task suites (FD-Bench, FDBench-v2, FULL-DUPLEX-BENCH v1.5) is encouraged (Lin et al., 9 Oct 2025, Lin et al., 30 Jul 2025), but further consensus is needed on evaluation priorities (low-latency vs. semantic coherence).
C. Architectural Divergence and Hybridization There is no consensus between modular (engineered) vs. end-to-end (learned) system architectures. Hybrid approaches integrating explicit predictive synchronization (NTPP, unit-based gating) with lightweight modular arbitration show promise (Chen et al., 18 Sep 2025, Yu et al., 28 Jan 2026, Yang et al., 10 Mar 2026).
D. Handling Noise, Multi-Party, and Multimodal Contexts Robustness to noise, multiple speakers, ambiguous signals, and complex conversational intent remains underdeveloped (Peng et al., 25 Jul 2025, Wang et al., 23 Apr 2026). Future directions emphasize multimodal integration (vision, gaze), adaptive context modeling, and dynamic thresholding for turn negotiation.
E. Extensibility and Community Resources Benchmarks now support plug-in adapters for new models and tasks, public leaderboards, and extensibility to new scenarios via scenario files and orchestration logic (Lin et al., 9 Oct 2025, Wang et al., 23 Apr 2026). Open-sourcing of code, data, and evaluation protocols is fostering reproducibility and cross-lab comparison.
7. Impact and Outlook
The Full-Duplex Interaction Track has accelerated the transition from rigid turn-based SDS toward highly responsive, resilient, and natural spoken dialogue agents. Through standardized datasets, robust evaluation pipelines, and rigorous benchmarking, it is now possible to quantitatively rank, analyze, and compare state-of-the-art full-duplex systems across real-world interruption, overlap, backchannel, and dynamic turn-taking scenarios.
As research advances—integrating self-supervised learning, multimodal signals, reinforcement-based adaptation, and beyond—future full-duplex tracks are projected to set new standards, galvanizing progress toward truly human-like open-domain conversational AI systems (Wang et al., 23 Apr 2026, Chen et al., 18 Sep 2025, Lin et al., 9 Oct 2025, Lin et al., 30 Jul 2025, Yan et al., 16 Mar 2026, Li et al., 21 Apr 2026, Yang et al., 10 Mar 2026).