- The paper introduces a unified full-duplex model that embeds listening, speaking, planning, and tool invocation into a time-synchronized three-channel backbone.
- It demonstrates significant latency improvements with sub-second tool-call and turn-taking performance, outperforming traditional ASR+LLM cascades.
- The native action channel enables real-time planning and multi-action tool calls, paving the way for richer, multimodal spoken dialogue systems.
DuplexSLA: A Full-Duplex Foundation Model for Synchronized Speech, Language, and Action
Introduction
DuplexSLA introduces a native full-duplex speech-language-action (SLA) foundation model, addressing the limitations of turn-based spoken dialogue systems and existing duplex backbones by integrating synchronized speech, semantic turn-taking, and real-time tool calling. Unlike traditional ASR+LLM+TTS cascades or models dependent on external semantic VADs, DuplexSLA embeds listening, speaking, planning, and tool invocation into a unified, time-aligned backbone. This technical leap enables the model to exhibit agentic speech behavioursโnatural interruptions, backchanneling, and interleaved tool executionโon a strict conversational clock with minimal latency.
The core innovation is a dual-stream three-channel setup with fixed 160 ms conversational chunks:
- User Channel: Causally encoded user audio (2ร80 ms features per chunk).
- Assistant Channel: Discrete TA4 assistant speech (text anchor plus 4 discrete audio tokens per chunk at 40 ms stride).
- Action Channel: Rate-limited (โค10 tokens per chunk) textual stream carrying delayed transcripts, turn-taking labels (interrupt, backchannel, response), free-form planning, and JSON-style tool calls.
The LLM backbone autoregressively generates the assistant TA4 and the action channel in lockstep, while the user audio is processed causally and not predicted by the model. The action channel is a critical design element: it externalizes interaction control and tool use into a time-aligned textual lane, decoupled from the assistant speech. All behavioral logic is embedded nativelyโthere is no recourse to external VAD or rule-based controllers.
Chunks are serialized for the LLM as:
1
|
<|user_audio_begin|> UU <|user_audio_end|> <|assistant_audio_begin|> TAAAA <|assistant_audio_end|> (action text) <|action_end|> |
This scheme ensures strict temporal synchronization between agent audio, textual actions, and external functions, enabling real-time agentic behavior.
Training Data and Recipe
DuplexSLA necessitates a custom data pipeline due to the misalignment between standard dialogue corpora and the dual-stream three-channel structure. The data construction pipeline:
- Annotation: An LLM semantically annotates raw dialogues with tool-call objects, rationale fragments, and turn-taking events, each aligned to chunk-indexed semantic triggers.
- Synthesis: User and assistant utterances are force-aligned and voice-cloned, merged to reflect chunk-level semantics.
- Supervision: Assistant TA4 and action streams are supervised directly; the user audio stream provides only observed inputs.
Training is staged:
- Continued Pretraining (CPT): 500k hours (320k duplex dialogues, 2ร90k ASR, 1.92M text samples) introduces the backbone to dual-stream three-channel serialization, with explicit time alignment and silence handling.
- Capability-Oriented Post-Training: 50k hours, specializing in interaction-control (pause, interrupt, backchannel) and three tool-call styles, finely tunes the model for subsecond interaction behaviours and time-critical agentic actions.
Dual-side ASR is critical for enforcing time alignment between speech and action emission, a prerequisite for low-latency tool use.
DuplexSLA's fundamental distinction is the integration of semantic turn-taking decisions and in-conversation tool planning into the same decoding loop as assistant speech, yielding several unique behaviors:
- Semantic-Driven Turn-Taking: Pauses, interruptions, and backchannels are managed from the model's internal representation, not heuristics or external VAD postprocessing. The model emits control labels on the action channel, instantaneously altering assistant speech flow with chunk-level granularity.
- Native In-Conversation Planning and Tool Calls: Planning text and structured tool calls are emitted on the action channel synchronously, without halting or modifying the assistant speech stream. Backchannel-triggered and multi-action tool calls are naturally interleaved, annotated with explicit timestamps.
This coordinated emission schema eliminates the latency penalty endemic to turn-based or cascade architectures.
Evaluation and Numerical Results
Evaluation is conducted on DuplexSLA-Benchโa dedicated 2,100-case duplex benchmarkโcovering four turn-taking scenarios (normal, pause, interrupt, backchannel) and three tool-call styles (single, multi, backchannel-action).
- Latency: DuplexSLA achieves average tool-call delays of 0.64s (single), 0.68s (multi), and 0.57s (backchannel-action)โ~4ร faster than the ASR+LLM cascade baseline (delays of 2.33sโ4.71s).
- Accuracy: DuplexSLA matches or closely trails cascade accuracy (85.6โ96.0%), with the largest drop in the more difficult multi-action setting.
Turn-Taking
- Backchannel Handling: DuplexSLA attains 98.33% correct backchannel responses, substantially outperforming all baselines (โค40%).
- Latency: Sub-second reaction in all scenarios (0.27sโ0.40s), consistently lower than commercial (Gemini, GPT-Realtime) and open-source (MiniCPM-o, PersonaPlex) competitors (typically 0.6โ1.7s).
- Pause/Interrupt: 93โ99% accuracy across scenarios, validating robust chunk-level semantic control.
Baselines either introduce significant delay or degrade on scenarios (such as pause/backchannel) not natively represented in their architectures.
Implications and Future Directions
DuplexSLA demonstrates that agentic, real-time spoken dialogue with precise tool use and nuanced turn-taking is achievable within a single backbone, given appropriate data and serialization. The explicit action channel is criticalโnot just for tool-use speed but for modeling the intertwined timing of natural conversation. The approach opens several future directions:
- Richer Action/Planning Channels: Extending the action lane to complex sequential workflows, more expressive agentic planning, and richer multimodal or task-specific APIs.
- Open-Domain/Device Integration: Seamless control of broader on-device and cloud-based tool ecosystems, with real-time synchronization between speech and action.
- Multilinguality and Emotional Intelligence: Time-synchronized affective and multilingual capabilities to match the complexity of human spoken interaction.
- Fine Latency Control and Deployment Agility: Architectural decoupling of the per-chunk token budget enables practical tuning for diverse hardware and user experience constraints.
Conclusion
DuplexSLA establishes an integrated framework for full-duplex speech-language-action, unifying semantic turn-taking, structured planning, and tool use on a strict conversational clock. It achieves state-of-the-art subsecond latency across nuanced turn-taking and tool-call scenarios, outperforming both pipeline and alternative duplex architectures in responsiveness and fluency. The model makes a compelling case for explicit, time-synchronized action channels as the foundation for future agentic spoken dialogue systems (2605.20755).