FlexDuo: Modular Full-Duplex Dialogue System
- FlexDuo is a modular full-duplex control system that decouples control logic from half-duplex systems to enable real-time bidirectional speech dialogue.
- It implements a three-state finite-state machine—Speak, Listen, and Idle—to manage turn-taking and filter contextual noise effectively.
- The plug-and-play design integrates turn-based LLMs without retraining, significantly reducing false interruptions and improving dialogue coherence.
FlexDuo is a modular full-duplex control system engineered to enable real-time, bidirectional speech dialogue on top of existing half-duplex spoken dialogue system (SDS) architectures. It addresses core limitations of monolithic, tightly coupled full-duplex SDS designs—specifically, their lack of modularity and inability to filter contextual noise—through explicit decoupling of duplex control logic and the introduction of a finite-state dialogue model with an explicit Idle state. This facilitates flexible, plug-and-play integration of turn-based LLMs into full-duplex interaction scenarios without retraining dialogue model weights, while simultaneously improving turn-taking, reducing false interruptions, and maintaining high dialogue coherence (Liao et al., 19 Feb 2025).
1. Modular Architecture and Decoupling of Duplex Control
FlexDuo is instantiated as a stand-alone control module that precedes any half-duplex speech dialogue system, such as GLM4-voice or MiniCPM-o. Rather than embedding turn-taking and interruption handling mechanisms within a monolithic SDS pipeline, FlexDuo implements these functions externally, exposing a well-defined application programming interface (API). Its primary inputs are: (1) a segmented, real-time audio stream and (2) the last generated textual response from the half-duplex LLM. The outputs are: (a) dialogue control signals—drawn from a seven-action set—and (b) a possibly filtered or buffered sequence of audio blocks for subsequent ASR or semantic encoding.
The internal architecture comprises three primary submodules:
- Context Manager: Maintains a running log of turn-by-turn text history across the interaction.
- State Manager: Implements a finite-state machine (FSM) that, on a 120 ms interval, computes dialogue state transitions using the current dialogue state, buffered audio (in a sliding window), and dialogue context.
- Sliding-Window Buffer: Dynamically accumulates or discards audio blocks based on the active FSM state.
This approach allows FlexDuo to drive existing LLMs using only high-level control signals—"Speak," "Listen," or "Idle"—eliminating the need for invasive model retraining or modification of dialogue policy logic in the generator itself. Only FlexDuo is finetuned for full-duplex control; the LLM remains fixed, except for compatibility hooks relating to external control actions (Liao et al., 19 Feb 2025).
2. Three-State Dialogue Model and Idle State
FlexDuo advances the state transition model of full-duplex SDS by explicitly introducing a three-state FSM: Speak, Listen, and Idle. This diverges from legacy binary models that only recognize speaking and listening phases.
- Speak: The assistant system actively vocalizes a response; the LLM generates audio tokens.
- Listen: The system awaits user speech; all incoming audio is appended to the buffer for ASR or semantic parsing.
- Idle: Audio received is classified as irrelevant—background noise, non-target speakers, or minimal feedback signals (e.g., "mm-hm")—and is thereby excluded from downstream processing to maintain semantic integrity.
The Idle state fulfills a dual function: (a) aggressive noise and non-target interjection filtering, and (b) control of buffer updates such that only whole, contextually meaningful utterance segments (Inter-Pausal Units, or IPUs) are passed forward to the dialogue policy. During Idle and Speak, the buffer shifts to a fixed-length sliding window (default w=5 blocks, each ≈120 ms), preventing partial or noisy units from contaminating the semantic context (Liao et al., 19 Feb 2025).
3. Dialogue Control Actions and Transition Mechanisms
The FSM at the core of FlexDuo generates one of seven control actions at each timestep:
| Action | Description | State Transition |
|---|---|---|
| K.S | Remain in Speak | Speak → Speak |
| K.L | Remain in Listen | Listen → Listen |
| K.I | Remain in Idle | Idle → Idle |
| S2L | Speak → Listen | Assistant interrupted by User |
| S2I | Speak → Idle | Assistant finishes turn |
| L2S | Listen → Speak | Assistant begins/continues speaking |
| I2L | Idle → Listen | User initiates target speech |
At each interval , the state manager determines the control action , where is the current text context, is the previous FSM state, and is the audio buffer. This mechanism enables the controller to distinguish non-dialogic background events from user speech, thus effectively orchestrating seamless conversational handover and interruption handling (Liao et al., 19 Feb 2025).
4. Plug-and-Play Integration and Module Interfaces
FlexDuo is architected for minimal invasion: it interposes at the audio and control signal boundary. The LLM is presented with exactly three operations—"Listen," "Idle," and "Speak"—according to real-time control signals from FlexDuo. The LLM's ASR or semantic encoder is only activated during "Listen"; audio is blocked during "Idle," and generation proceeds during "Speak." During finetuning, dialogue actions are represented as special tokens within the LLM vocabulary, but at inference, these tokens are replaced by external controls.
Any turn-based LLM—so long as it exposes compatible token-control hooks—can be operated in this augmented full-duplex mode. This design also allows for independent improvement of the LLM and FlexDuo controller, further facilitating modular SDS research and deployment scenarios (Liao et al., 19 Feb 2025).
5. Experimental Protocol and Evaluation Metrics
The evaluation of FlexDuo was carried out on both the English and Chinese Fisher corpora. Key aspects include:
- Data: English Fisher (671 h) and Chinese Fisher (263 h) after filtering.
- Segmentation/Labelling: IPUs were extracted by VAD and merged when silence was ms; backchannels were identified via complete overlap.
- Training: Qwen2-audio-7B-Instruct served as the base model (audio encoder frozen), with only the LLM head finetuned using cross-entropy loss on action tokens. Optimization used AdamW (, 40k steps, DeepSpeed ZeRO-3).
- Baselines: Integrated full-duplex systems (Moshi, MinMo, Freeze-Omni) and VAD-based half-duplex systems (VAD+GLM4-voice, VAD+MiniCPM-o).
- Metrics:
- Turn-taking Pos.F1@K (K=1/5/10) for both Assistant (Speak transition) and User (exit from Speak)
- False interruption rates and :
- 0
- Dialogue quality: conditional perplexity (cond. PPL) as measured by Qwen2.5-1.5B-Instruct across multi-turn transcripts (Liao et al., 19 Feb 2025).
6. Empirical Results and Ablation Findings
FlexDuo demonstrated significant improvements on full-duplex dialogue tasks:
- Turn-taking (English, 1): Combined F1 = 0.79 (Assistant F1@1/5/10 = 0.68/0.91/0.93; User F1@1/5/10 = 0.89/1.0/1.0)
- False interruptions: Combined = 0.30; Assistant = 0.35; User = 0.25
- Relative improvement over best integrated baseline: False interruption rate reduced by 24.9%, turn-taking accuracy increased by 7.6%
- Dialogue quality (conditional PPL): English = 28.94 (VAD baseline: 64.32, 35.3% reduction); Chinese = 44.45 (VAD baseline: 63.51, 19% reduction)
Ablation studies confirmed the centrality of the Idle state: its removal decreased combined turn-taking from 0.79 to 0.63 (–15.7%) and increased false interruptions from 0.30 to 0.44 (+13.6%). Optimal sliding-window size was found to be 2; values below this threshold harmed semantic segmentation, while larger 3 introduced additional latency (Liao et al., 19 Feb 2025).
7. Theoretical Insights, Limitations, and Extensions
Explicit modeling of non-dialogic audio via the Idle state allows FlexDuo to filter ambient speech and short acknowledgments, sharply reducing cognitive and semantic noise. The FSM's action set supports rapid adaptation without requiring full LLM context resets between state transitions.
Limitations include the inability of half-duplex LLMs to generate backchannels autonomously; integration with multi-modal cues (such as gaze or gesture) remains unaddressed. Reinforcement learning could further optimize the interruption/turn-taking trade-off. The pluggable and FSM-based design of FlexDuo establishes a new technical pathway for flexible, efficient, and robust full-duplex spoken dialogue systems (Liao et al., 19 Feb 2025).